Обсуждение: design for parallel backup

Поиск
Список
Период
Сортировка

design for parallel backup

От
Robert Haas
Дата:
Hi,

Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
there's a proposal for a parallel backup patch which works in the way
that I have always thought parallel backup would work: instead of
having a monolithic command that returns a series of tarballs, you
request individual files from a pool of workers. Leaving aside the
quality-of-implementation issues in that patch set, I'm starting to
think that the design is fundamentally wrong and that we should take a
whole different approach. The problem I see is that it makes a
parallel backup and a non-parallel backup work very differently, and
I'm starting to realize that there are good reasons why you might want
them to be similar.

Specifically, as Andres recently pointed out[1], almost anything that
you might want to do on the client side, you might also want to do on
the server side. We already have an option to let the client compress
each tarball, but you might also want the server to, say, compress
each tarball[2]. Similarly, you might want either the client or the
server to be able to encrypt each tarball, or compress but with a
different compression algorithm than gzip. If, as is presently the
case, the server is always returning a set of tarballs, it's pretty
easy to see how to make this work in the same way on either the client
or the server, but if the server returns a set of tarballs in
non-parallel backup cases, and a set of tarballs in parallel backup
cases, it's a lot harder to see how that any sort of server-side
processing should work, or how the same mechanism could be used on
either the client side or the server side.

So, my new idea for parallel backup is that the server will return
tarballs, but just more of them. Right now, you get base.tar and
${tablespace_oid}.tar for each tablespace. I propose that if you do a
parallel backup, you should get base-${N}.tar and
${tablespace_oid}-${N}.tar for some or all values of N between 1 and
the number of workers, with the server deciding which files ought to
go in which tarballs. This is more or less the naming convention that
BART uses for its parallel backup implementation, which, incidentally,
I did not write. I don't really care if we pick something else, but it
seems like a sensible choice. The reason why I say "some or all" is
that some workers might not get any of the data for a given
tablespace. In fact, it's probably desirable to have different workers
work on different tablespaces as far as possible, to maximize parallel
I/O, but it's quite likely that you will have more workers than
tablespaces. So you might end up, with pg_basebackup -j4, having the
server send you base-1.tar and base-2.tar and base-4.tar, but not
base-3.tar, because worker 3 spent all of its time on user-defined
tablespaces, or was just out to lunch.

Now, if you use -Fp, those tar files are just going to get extracted
anyway by pg_basebackup itself, so you won't even know they exist.
However, if you use -Ft, you're going to end up with more files than
before. This seems like something of a wart, because you wouldn't
necessarily expect that the set of output files produced by a backup
would depend on the degree of parallelism used to take it. However,
I'm not sure I see a reasonable alternative. The client could try to
glue all of the related tar files sent by the server together into one
big tarfile, but that seems like it would slow down the process of
writing the backup by forcing the different server connections to
compete for the right to write to the same file. Moreover, if you end
up needing to restore the backup, having a bunch of smaller tar files
instead of one big one means you can try to untar them in parallel if
you like, so it seems not impossible that it could be advantageous to
have them split in that case as well.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://postgr.es/m/20200412191702.ul7ohgv5gus3tsvo@alap3.anarazel.de
[2] https://www.postgresql.org/message-id/20190823172637.GA16436%40tamriel.snowman.net



Re: design for parallel backup

От
Amit Kapila
Дата:
On Wed, Apr 15, 2020 at 9:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.
>
> Specifically, as Andres recently pointed out[1], almost anything that
> you might want to do on the client side, you might also want to do on
> the server side. We already have an option to let the client compress
> each tarball, but you might also want the server to, say, compress
> each tarball[2]. Similarly, you might want either the client or the
> server to be able to encrypt each tarball, or compress but with a
> different compression algorithm than gzip. If, as is presently the
> case, the server is always returning a set of tarballs, it's pretty
> easy to see how to make this work in the same way on either the client
> or the server, but if the server returns a set of tarballs in
> non-parallel backup cases, and a set of tarballs in parallel backup
> cases, it's a lot harder to see how that any sort of server-side
> processing should work, or how the same mechanism could be used on
> either the client side or the server side.
>
> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs.
>

It is not apparent how you are envisioning this division on the
server-side.  I think in the currently proposed patch, each worker on
the client-side requests the specific files. So, how are workers going
to request such numbered files and how we will ensure that the work
division among workers is fair?

> This is more or less the naming convention that
> BART uses for its parallel backup implementation, which, incidentally,
> I did not write. I don't really care if we pick something else, but it
> seems like a sensible choice. The reason why I say "some or all" is
> that some workers might not get any of the data for a given
> tablespace. In fact, it's probably desirable to have different workers
> work on different tablespaces as far as possible, to maximize parallel
> I/O, but it's quite likely that you will have more workers than
> tablespaces. So you might end up, with pg_basebackup -j4, having the
> server send you base-1.tar and base-2.tar and base-4.tar, but not
> base-3.tar, because worker 3 spent all of its time on user-defined
> tablespaces, or was just out to lunch.
>
> Now, if you use -Fp, those tar files are just going to get extracted
> anyway by pg_basebackup itself, so you won't even know they exist.
> However, if you use -Ft, you're going to end up with more files than
> before. This seems like something of a wart, because you wouldn't
> necessarily expect that the set of output files produced by a backup
> would depend on the degree of parallelism used to take it. However,
> I'm not sure I see a reasonable alternative. The client could try to
> glue all of the related tar files sent by the server together into one
> big tarfile, but that seems like it would slow down the process of
> writing the backup by forcing the different server connections to
> compete for the right to write to the same file.
>

I think it also depends to some extent what we decide in the nearby
thread [1] related to support of compression/encryption.  Say, if we
want to support a new compression on client-side then we need to
anyway process the contents of each tar file in which case combining
into single tar file might be okay but not sure what is the right
thing here. I think this part needs some more thoughts.

[1] -
https://www.postgresql.org/message-id/CA%2BTgmoYr7%2B-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF%3DkUqkG%3DFw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: design for parallel backup

От
Robert Haas
Дата:
On Mon, Apr 20, 2020 at 8:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> It is not apparent how you are envisioning this division on the
> server-side.  I think in the currently proposed patch, each worker on
> the client-side requests the specific files. So, how are workers going
> to request such numbered files and how we will ensure that the work
> division among workers is fair?

I think that the workers would just say "give me my share of the base
backup" and then the server would divide up the files as it went. It
would probably keep a queue of whatever files still need to be
processed in shared memory and each process would pop items from the
queue to send to its client.

> I think it also depends to some extent what we decide in the nearby
> thread [1] related to support of compression/encryption.  Say, if we
> want to support a new compression on client-side then we need to
> anyway process the contents of each tar file in which case combining
> into single tar file might be okay but not sure what is the right
> thing here. I think this part needs some more thoughts.

Yes, it needs more thought, but the central idea is to try to create
something that is composable. For example, if we have to do LZ4
compression, and code to do GPG encryption, than we should be able to
do both without adding any more code. Ideally, we should also be able
to either of those operations either on the client side or on the
server side, using the same code either way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Peter Eisentraut
Дата:
On 2020-04-15 17:57, Robert Haas wrote:
> Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.

That would clearly be a good goal.  Non-parallel backup should ideally 
be parallel backup with one worker.

But it doesn't follow that the proposed design is wrong.  It might just 
be that the design of the existing backup should change.

I think making the wire format so heavily tied to the tar format is 
dubious.  There is nothing particularly fabulous about the tar format. 
If the server just sends a bunch of files with metadata for each file, 
the client can assemble them in any way they want: unpacked, packed in 
several tarball like now, packed all in one tarball, packed in a zip 
file, sent to S3, etc.

Another thing I would like to see sometime is this: Pull a minimal 
basebackup, start recovery and possibly hot standby before you have 
received all the files.  When you need to access a file that's not there 
yet, request that as a priority from the server.  If you nudge the file 
order a little with perhaps prewarm-like data, you could get a mostly 
functional standby without having to wait for the full basebackup to 
finish.  Pull a file on request is a requirement for this.

> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs.

I understand the other side of this:  Why not compress or encrypt the 
backup already on the server side?  Makes sense.  But this way seems 
weird and complicated.  If I want a backup, I want one file, not an 
unpredictable set of files.  How do I even know I have them all?  Do we 
need a meta-manifest?

A format such as ZIP would offer more flexibility, I think.  You can 
build a single target file incrementally, you can compress or encrypt 
each member file separately, thus allowing some compression etc. on the 
server.  I'm not saying it's perfect for this, but some more thinking 
about the archive formats would potentially give some possibilities.

All things considered, we'll probably want more options and more ways of 
doing things.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-15 11:57:29 -0400, Robert Haas wrote:
> Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.
>
> Specifically, as Andres recently pointed out[1], almost anything that
> you might want to do on the client side, you might also want to do on
> the server side. We already have an option to let the client compress
> each tarball, but you might also want the server to, say, compress
> each tarball[2]. Similarly, you might want either the client or the
> server to be able to encrypt each tarball, or compress but with a
> different compression algorithm than gzip. If, as is presently the
> case, the server is always returning a set of tarballs, it's pretty
> easy to see how to make this work in the same way on either the client
> or the server, but if the server returns a set of tarballs in
> non-parallel backup cases, and a set of tarballs in parallel backup
> cases, it's a lot harder to see how that any sort of server-side
> processing should work, or how the same mechanism could be used on
> either the client side or the server side.
>
> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs. This is more or less the naming convention that
> BART uses for its parallel backup implementation, which, incidentally,
> I did not write. I don't really care if we pick something else, but it
> seems like a sensible choice. The reason why I say "some or all" is
> that some workers might not get any of the data for a given
> tablespace. In fact, it's probably desirable to have different workers
> work on different tablespaces as far as possible, to maximize parallel
> I/O, but it's quite likely that you will have more workers than
> tablespaces. So you might end up, with pg_basebackup -j4, having the
> server send you base-1.tar and base-2.tar and base-4.tar, but not
> base-3.tar, because worker 3 spent all of its time on user-defined
> tablespaces, or was just out to lunch.

One question I have not really seen answered well:

Why do we want parallelism here. Or to be more precise: What do we hope
to accelerate by making what part of creating a base backup
parallel. There's several potential bottlenecks, and I think it's
important to know the design priorities to evaluate a potential design.

Bottlenecks (not ordered by importance):
- compression performance (likely best solved by multiple compression
  threads and a better compression algorithm)
- unencrypted network performance (I'd like to see benchmarks showing in
  which cases multiple TCP streams help / at which bandwidth it starts
  to help)
- encrypted network performance, i.e. SSL overhead (not sure this is an
  important problem on modern hardware, given hardware accelerated AES)
- checksumming overhead (a serious problem for cryptographic checksums,
  but presumably not for others)
- file IO (presumably multiple facets here, number of concurrent
  in-flight IOs, kernel page cache overhead when reading TBs of data)

I'm not really convinced that design addressing the more crucial
bottlenecks really needs multiple fe/be connections. But that seems to
be have been the focus of the discussion so far.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
Thanks for your thoughts.

On Mon, Apr 20, 2020 at 4:02 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> That would clearly be a good goal.  Non-parallel backup should ideally
> be parallel backup with one worker.

Right.

> But it doesn't follow that the proposed design is wrong.  It might just
> be that the design of the existing backup should change.
>
> I think making the wire format so heavily tied to the tar format is
> dubious.  There is nothing particularly fabulous about the tar format.
> If the server just sends a bunch of files with metadata for each file,
> the client can assemble them in any way they want: unpacked, packed in
> several tarball like now, packed all in one tarball, packed in a zip
> file, sent to S3, etc.

Yeah, that's true, and I agree that there's something a little
unsatisfying and dubious about the current approach. However, I am not
sure that there is sufficient reason to change it to something else,
either. After all, what purpose would such a change serve? The client
can already do any of the things you mention here, provided that it
can interpret the data sent by the server, and pg_basebackup already
has code to do exactly this. Right now, we have pretty good
pg_basebackup compatibility across server versions, and if we change
the format, then we won't, unless we make both the client and the
server understand both formats. I'm not completely averse to such a
change if it has sufficient benefits to make it worthwhile, but it's
not clear to me that it does.

> Another thing I would like to see sometime is this: Pull a minimal
> basebackup, start recovery and possibly hot standby before you have
> received all the files.  When you need to access a file that's not there
> yet, request that as a priority from the server.  If you nudge the file
> order a little with perhaps prewarm-like data, you could get a mostly
> functional standby without having to wait for the full basebackup to
> finish.  Pull a file on request is a requirement for this.

True, but that can always be implemented as a separate feature. I
won't be sad if that feature happens to fall out of work in this area,
but I don't think the possibility that we'll some day have such
advanced wizardry should bias the design of this feature very much.
One pretty major problem with this is that you can't open for
connections until you've reached a consistent state, and you can't say
that you're in a consistent state until you've replayed all the WAL
generated during the backup, and you can't say that you're at the end
of the backup until you've copied all the files. So, without some
clever idea, this would only allow you to begin replay sooner; it
would not allow you to accept connections sooner. I suspect that makes
it significantly less appealing.

> > So, my new idea for parallel backup is that the server will return
> > tarballs, but just more of them. Right now, you get base.tar and
> > ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> > parallel backup, you should get base-${N}.tar and
> > ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> > the number of workers, with the server deciding which files ought to
> > go in which tarballs.
>
> I understand the other side of this:  Why not compress or encrypt the
> backup already on the server side?  Makes sense.  But this way seems
> weird and complicated.  If I want a backup, I want one file, not an
> unpredictable set of files.  How do I even know I have them all?  Do we
> need a meta-manifest?

Yes, that's a problem, but...

> A format such as ZIP would offer more flexibility, I think.  You can
> build a single target file incrementally, you can compress or encrypt
> each member file separately, thus allowing some compression etc. on the
> server.  I'm not saying it's perfect for this, but some more thinking
> about the archive formats would potentially give some possibilities.

...I don't think this really solves anything. I expect you would have
to write the file more or less sequentially, and I think that Amdahl's
law will not be kind to us.

> All things considered, we'll probably want more options and more ways of
> doing things.

Yes. That's why I'm trying to figure out how to create a flexible framework.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Robert Haas
Дата:
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> Why do we want parallelism here. Or to be more precise: What do we hope
> to accelerate by making what part of creating a base backup
> parallel. There's several potential bottlenecks, and I think it's
> important to know the design priorities to evaluate a potential design.
>
> Bottlenecks (not ordered by importance):
> - compression performance (likely best solved by multiple compression
>   threads and a better compression algorithm)
> - unencrypted network performance (I'd like to see benchmarks showing in
>   which cases multiple TCP streams help / at which bandwidth it starts
>   to help)
> - encrypted network performance, i.e. SSL overhead (not sure this is an
>   important problem on modern hardware, given hardware accelerated AES)
> - checksumming overhead (a serious problem for cryptographic checksums,
>   but presumably not for others)
> - file IO (presumably multiple facets here, number of concurrent
>   in-flight IOs, kernel page cache overhead when reading TBs of data)
>
> I'm not really convinced that design addressing the more crucial
> bottlenecks really needs multiple fe/be connections. But that seems to
> be have been the focus of the discussion so far.

I haven't evaluated this. Both BART and pgBackRest offer parallel
backup options, and I'm pretty sure both were performance tested and
found to be very significantly faster, but I didn't write the code for
either, nor have I evaluated either to figure out exactly why it was
faster.

My suspicion is that it has mostly to do with adequately utilizing the
hardware resources on the server side. If you are network-constrained,
adding more connections won't help, unless there's something shaping
the traffic which can be gamed by having multiple connections.
However, as things stand today, at any given point in time the base
backup code on the server will EITHER be attempting a single
filesystem I/O or a single network I/O, and likewise for the client.
If a backup client - either current or hypothetical - is compressing
and encrypting, then it doesn't have either a filesystem I/O or a
network I/O in progress while it's doing so. You take not only the hit
of the time required for compression and/or encryption, but also use
that much less of the available network and/or I/O capacity.

While I agree that some of these problems could likely be addressed in
other ways, parallelism seems to offer an approach that could solve
multiple issues at the same time. If you want to address it without
that, you need asynchronous filesystem I/O and asynchronous network
I/O and both of those on both the client and server side, plus
multithreaded compression and multithreaded encryption and maybe some
other things. That sounds pretty hairy and hard to get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
> My suspicion is that it has mostly to do with adequately utilizing the
> hardware resources on the server side. If you are network-constrained,
> adding more connections won't help, unless there's something shaping
> the traffic which can be gamed by having multiple connections.
> However, as things stand today, at any given point in time the base
> backup code on the server will EITHER be attempting a single
> filesystem I/O or a single network I/O, and likewise for the client.

Well, kinda, but not really. Both file reads (server)/writes(client) and
network send(server)/recv(client) are buffered by the OS, and the file
IO is entirely sequential.

That's not true for checksum computations / compressions to the same
degree. They're largely bottlenecked in userland, without the kernel
doing as much async work.


> If a backup client - either current or hypothetical - is compressing
> and encrypting, then it doesn't have either a filesystem I/O or a
> network I/O in progress while it's doing so. You take not only the hit
> of the time required for compression and/or encryption, but also use
> that much less of the available network and/or I/O capacity.

I don't think it's really the time for network/file I/O that's the
issue. Sure memcpy()'ing from the kernel takes time, but compared to
encryption/compression it's not that much.  Especially for compression,
it's not really lack of cycles for networking that prevent a higher
throughput, it's that after buffering a few MB there's just no point
buffering more, given compression will plod along with 20-100MB/s.


> While I agree that some of these problems could likely be addressed in
> other ways, parallelism seems to offer an approach that could solve
> multiple issues at the same time. If you want to address it without
> that, you need asynchronous filesystem I/O and asynchronous network
> I/O and both of those on both the client and server side, plus
> multithreaded compression and multithreaded encryption and maybe some
> other things. That sounds pretty hairy and hard to get right.

I'm not really convinced. You're complicating the wire protocol by
having multiple tar files with overlapping contents. With the
consequence that clients need additional logic to deal with that. We'll
not get one manifest, but multiple ones, etc.

We already do network IO non-blocking, and leaving the copying to
kernel, the kernel does the actual network work asynchronously. Except
for file boundaries the kernel does asynchronous read IO for us (but we
should probably hint it to do that even at the start of a new file).

I think we're quite a bit away from where we need to worry about making
encryption multi-threaded:
andres@awork3:~/src/postgresql$ openssl speed -evp aes-256-ctr
Doing aes-256-ctr for 3s on 16 size blocks: 81878709 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 64 size blocks: 71062203 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 256 size blocks: 31738391 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 1024 size blocks: 10043519 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 8192 size blocks: 1346933 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 16384 size blocks: 674680 aes-256-ctr's in 3.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Tue Mar 31 21:59:59 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2
-fdebug-prefix-map=/build/openssl-hsg853/openssl-1.1.1f=.-fstack-protector-strong -Wformat -Werror=format-security
-DOPENSSL_USE_NODELETE-DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT
-DOPENSSL_BN_ASM_MONT5-DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM
-DAESNI_ASM-DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time
-D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-ctr     436686.45k  1515993.66k  2708342.70k  3428187.82k  3678025.05k  3684652.37k


So that really just leaves compression (and perhaps cryptographic
checksumming). Given that we can provide nearly all of the benefits of
multi-stream parallelism in a compatible way by using
parallelism/threads at that level, I just have a hard time believing the
complexity of doing those tasks in parallel is bigger than multi-stream
parallelism. And I'd be fairly unsurprised if you'd end up with a lot
more "bubbles" in the pipeline when using multi-stream parallelism.

Greetings,

Andres Freund



Re: design for parallel backup

От
Amit Kapila
Дата:
On Tue, Apr 21, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
>
> > If a backup client - either current or hypothetical - is compressing
> > and encrypting, then it doesn't have either a filesystem I/O or a
> > network I/O in progress while it's doing so. You take not only the hit
> > of the time required for compression and/or encryption, but also use
> > that much less of the available network and/or I/O capacity.
>
> I don't think it's really the time for network/file I/O that's the
> issue. Sure memcpy()'ing from the kernel takes time, but compared to
> encryption/compression it's not that much.  Especially for compression,
> it's not really lack of cycles for networking that prevent a higher
> throughput, it's that after buffering a few MB there's just no point
> buffering more, given compression will plod along with 20-100MB/s.
>

It is quite likely that compression can benefit more from parallelism
as compared to the network I/O as that is mostly a CPU intensive
operation but I am not sure if we can just ignore the benefit of
utilizing the network bandwidth.  In our case, after copying from the
network we do write that data to disk, so during filesystem I/O the
network can be used if there is some other parallel worker processing
other parts of data.

Also, there may be some users who don't want their data to be
compressed due to some reason like the overhead of decompression is so
high that restore takes more time and they are not comfortable with
that as for them faster restore is much more critical then compressed
or fast back up.  So, for such things, the parallelism during backup
as being discussed in this thread will still be helpful.

OTOH, I think without some measurements it is difficult to say that we
have significant benefit by paralysing the backup without compression.
I have scanned the other thread [1] where the patch for parallel
backup was discussed and didn't find any performance numbers, so
probably having some performance data with that patch might give us a
better understanding of introducing parallelism in the backup.

[1] - https://www.postgresql.org/message-id/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-21 10:20:01 +0530, Amit Kapila wrote:
> It is quite likely that compression can benefit more from parallelism
> as compared to the network I/O as that is mostly a CPU intensive
> operation but I am not sure if we can just ignore the benefit of
> utilizing the network bandwidth.  In our case, after copying from the
> network we do write that data to disk, so during filesystem I/O the
> network can be used if there is some other parallel worker processing
> other parts of data.

Well, as I said, network and FS IO as done by server / pg_basebackup are
both fully buffered by the OS. Unless the OS throttles the userland
process, a large chunk of the work will be done by the kernel, in
separate kernel threads.

My workstation and my laptop can, in a single thread each, get close
20GBit/s of network IO (bidirectional 10GBit, I don't have faster - it's
a thunderbolt 10gbe card) and iperf3 is at 55% CPU while doing so. Just
connecting locally it's 45Gbit/s. Or over 8GBbyte/s of buffered
filesystem IO. And it doesn't even have that high per-core clock speed.

I just don't see this being the bottleneck for now.


> Also, there may be some users who don't want their data to be
> compressed due to some reason like the overhead of decompression is so
> high that restore takes more time and they are not comfortable with
> that as for them faster restore is much more critical then compressed
> or fast back up.  So, for such things, the parallelism during backup
> as being discussed in this thread will still be helpful.

I am not even convinced it'll be helpful in a large fraction of
cases. The added overhead of more connections / processes isn't free.

I believe there are some cases where it'd help. E.g. if there are
multiple tablespaces on independent storage, parallelism as described
here could end up to a significantly better utilization of the different
tablespaces. But that'd require sorting work between processes
appropriately.


> OTOH, I think without some measurements it is difficult to say that we
> have significant benefit by paralysing the backup without compression.
> I have scanned the other thread [1] where the patch for parallel
> backup was discussed and didn't find any performance numbers, so
> probably having some performance data with that patch might give us a
> better understanding of introducing parallelism in the backup.

Agreed, we need some numbers.

Greetings,

Andres Freund



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-20 22:31:49 -0700, Andres Freund wrote:
> On 2020-04-21 10:20:01 +0530, Amit Kapila wrote:
> > It is quite likely that compression can benefit more from parallelism
> > as compared to the network I/O as that is mostly a CPU intensive
> > operation but I am not sure if we can just ignore the benefit of
> > utilizing the network bandwidth.  In our case, after copying from the
> > network we do write that data to disk, so during filesystem I/O the
> > network can be used if there is some other parallel worker processing
> > other parts of data.
>
> Well, as I said, network and FS IO as done by server / pg_basebackup are
> both fully buffered by the OS. Unless the OS throttles the userland
> process, a large chunk of the work will be done by the kernel, in
> separate kernel threads.
>
> My workstation and my laptop can, in a single thread each, get close
> 20GBit/s of network IO (bidirectional 10GBit, I don't have faster - it's
> a thunderbolt 10gbe card) and iperf3 is at 55% CPU while doing so. Just
> connecting locally it's 45Gbit/s. Or over 8GBbyte/s of buffered
> filesystem IO. And it doesn't even have that high per-core clock speed.
>
> I just don't see this being the bottleneck for now.

FWIW, I just tested pg_basebackup locally.

Without compression and a stock postgres I get:
unix                tcp                  tcp+ssl:
1.74GiB/s           1.02GiB/s            699MiB/s

That turns out to be bottlenecked by the backup manifest generation.

Without compression and a stock postgres I get, and --no-manifest
unix                tcp                  tcp+ssl:
2.51GiB/s           1.63GiB/s            1.00GiB/s

I.e. all of them area already above 10Gbit/s network.

Looking at a profile it's clear that our small output buffer is the
bottleneck:
64kB Buffers + --no-manifest:
unix                tcp                  tcp+ssl:
2.99GiB/s           2.56GiB/s            1.18GiB/s

At this point the backend is not actually the bottleneck anymore,
instead it's pg_basebackup. Which is in part due to the small buffer
used for output data (i.e. libc's FILE buffering), and in part because
we spend too much time memmove()ing data, because of the "left-justify"
logic in pqCheckInBufferSpace().


- Andres



Re: design for parallel backup

От
Robert Haas
Дата:
On Tue, Apr 21, 2020 at 2:44 AM Andres Freund <andres@anarazel.de> wrote:
> FWIW, I just tested pg_basebackup locally.
>
> Without compression and a stock postgres I get:
> unix                tcp                  tcp+ssl:
> 1.74GiB/s           1.02GiB/s            699MiB/s
>
> That turns out to be bottlenecked by the backup manifest generation.

Whoa. That's unexpected, at least for me. Is that because of the
CRC-32C overhead, or something else? What do you get with
--manifest-checksums=none?

> Without compression and a stock postgres I get, and --no-manifest
> unix                tcp                  tcp+ssl:
> 2.51GiB/s           1.63GiB/s            1.00GiB/s
>
> I.e. all of them area already above 10Gbit/s network.
>
> Looking at a profile it's clear that our small output buffer is the
> bottleneck:
> 64kB Buffers + --no-manifest:
> unix                tcp                  tcp+ssl:
> 2.99GiB/s           2.56GiB/s            1.18GiB/s
>
> At this point the backend is not actually the bottleneck anymore,
> instead it's pg_basebackup. Which is in part due to the small buffer
> used for output data (i.e. libc's FILE buffering), and in part because
> we spend too much time memmove()ing data, because of the "left-justify"
> logic in pqCheckInBufferSpace().

Hmm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-21 07:18:20 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 2:44 AM Andres Freund <andres@anarazel.de> wrote:
> > FWIW, I just tested pg_basebackup locally.
> >
> > Without compression and a stock postgres I get:
> > unix                tcp                  tcp+ssl:
> > 1.74GiB/s           1.02GiB/s            699MiB/s
> >
> > That turns out to be bottlenecked by the backup manifest generation.
> 
> Whoa. That's unexpected, at least for me. Is that because of the
> CRC-32C overhead, or something else? What do you get with
> --manifest-checksums=none?

It's all CRC overhead. I don't see a difference with
--manifest-checksums=none anymore. We really should look for a better
"fast" checksum.

Regards,

Andres



Re: design for parallel backup

От
Robert Haas
Дата:
On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres@anarazel.de> wrote:
> It's all CRC overhead. I don't see a difference with
> --manifest-checksums=none anymore. We really should look for a better
> "fast" checksum.

Hmm, OK. I'm wondering exactly what you tested here. Was this over
your 20GiB/s connection between laptop and workstation, or was this
local TCP? Also, was the database being read from persistent storage,
or was it RAM-cached? How do you expect to take advantage of I/O
parallelism without multiple processes/connections?

Meanwhile, I did some local-only testing on my new 16GB MacBook Pro
laptop with all combinations of:

- UNIX socket, local TCP socket, local TCP socket with SSL
- Plain format, tar format, tar format with gzip
- No manifest ("omit"), manifest with no checksums, manifest with
CRC-32C checksums, manifest with SHA256 checksums.

The database is a fresh scale-factor 1000 pgbench database. No
concurrent database load. Observations:

- UNIX socket was slower than a local TCP socket, and about the same
speed as a TCP socket with SSL.
- CRC-32C is about 10% slower than no manifest and/or no checksums in
the manifest. SHA256 is 1.5-2x slower, but less when compression is
also used (see below).
- Plain format is a little slower than tar format; tar with gzip is
typically >~5x slower, but less when the checksum algorithm is SHA256
(again, see below).
- SHA256 + tar format with gzip is the slowest combination, but it's
"only" about 15% slower than no manifest, and about 3.3x slower than
no compression, presumably because the checksumming is slowing down
the server and the compression is slowing down the client.
- Fastest speeds I see in any test are ~650MB/s, and slowest are
~65MB/s, obviously benefiting greatly from the fact that this is a
local-only test.
- The time for a raw cp -R of the backup directory is about 10s, and
the fastest time to take a backup (tcp+tar+m:omit) is about 22s.
- In all cases I've checked so far both pg_basebackup and the server
backend are pegged at 98-100% CPU usage. I haven't looked into where
that time is going yet.

Full results and test script attached. I and/or my colleagues will try
to test out some other environments, but I'm not sure we have easy
access to anything as high-powered as a 20GiB/s interconnect.

It seems to me that the interesting cases may involve having lots of
available CPUs and lots of disk spindles, but a comparatively slow
pipe between the machines. I mean, if it takes 36 hours to read the
data from disk, you can't realistically expect to complete a full
backup in less than 36 hours. Incremental backup might help, but
otherwise you're just dead. On the other hand, if you can read the
data from the disk in 2 hours but it takes 36 hours to complete a
backup, it seems like you have more justification for thinking that
the backup software could perhaps do better. In such cases efficient
server-side compression may help a lot, but even then, I wonder
whether you can you read the data at maximum speed with only a single
process? I tend to doubt it, but I guess you only have to be fast
enough to saturate the network. Hmm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-21 14:01:28 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres@anarazel.de> wrote:
> > It's all CRC overhead. I don't see a difference with
> > --manifest-checksums=none anymore. We really should look for a better
> > "fast" checksum.
>
> Hmm, OK. I'm wondering exactly what you tested here. Was this over
> your 20GiB/s connection between laptop and workstation, or was this
> local TCP?

It was local TCP. The speeds I can reach are faster than the 10GiB/s
(unidirectional) I can do between the laptop & workstation, so testing
it over "actual" network isn't informative - I basically can reach line
speed between them with any method.


> Also, was the database being read from persistent storage, or was it
> RAM-cached?

It was in kernel buffer cache. But I can reach 100% utilization of
storage too (which is slightly slower than what I can do over unix
socket).

pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null
2.59GiB/s
find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null
2.53GiB/s
find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null
2.42GiB/s

I tested this with a -s 5000 DB, FWIW.


> How do you expect to take advantage of I/O parallelism without
> multiple processes/connections?

Which kind of I/O parallelism are you thinking of? Independent
tablespaces? Or devices that can handle multiple in-flight IOs? WRT the
latter, at least linux will keep many IOs in-flight for sequential
buffered reads.


> - UNIX socket was slower than a local TCP socket, and about the same
> speed as a TCP socket with SSL.

Hm. Interesting. Wonder if that a question of the unix socket buffer
size?

> - CRC-32C is about 10% slower than no manifest and/or no checksums in
> the manifest. SHA256 is 1.5-2x slower, but less when compression is
> also used (see below).
> - Plain format is a little slower than tar format; tar with gzip is
> typically >~5x slower, but less when the checksum algorithm is SHA256
> (again, see below).

I see about 250MB/s with -Z1 (from the source side). If I hack
pg_basebackup.c to specify a deflate level of 0 to gzsetparams, which
zlib docs says should disable compression, I get up to 700MB/s. Which
still is a factor of ~3.7 to uncompressed.

This seems largely due to zlib's crc32 computation not being hardware
accelerated:
-   99.75%     0.05%  pg_basebackup  pg_basebackup       [.] BaseBackup
   - 99.95% BaseBackup
      - 81.60% writeTarData
         - gzwrite
         - gz_write
            - gz_comp.constprop.0
               - 85.11% deflate
                  - 97.66% deflate_stored
                     + 87.45% crc32_z
                     + 9.53% __memmove_avx_unaligned_erms
                     + 3.02% _tr_stored_block
                    2.27% __memmove_avx_unaligned_erms
               + 14.86% __libc_write
      + 18.40% pqGetCopyData3



> It seems to me that the interesting cases may involve having lots of
> available CPUs and lots of disk spindles, but a comparatively slow
> pipe between the machines.

Hm, I'm not sure I am following. If network is the bottleneck, we'd
immediately fill the buffers, and that'd be that?

ISTM all of this is only really relevant if either pg_basebackup or
walsender is the bottleneck?


> I mean, if it takes 36 hours to read the
> data from disk, you can't realistically expect to complete a full
> backup in less than 36 hours. Incremental backup might help, but
> otherwise you're just dead. On the other hand, if you can read the
> data from the disk in 2 hours but it takes 36 hours to complete a
> backup, it seems like you have more justification for thinking that
> the backup software could perhaps do better. In such cases efficient
> server-side compression may help a lot, but even then, I wonder
> whether you can you read the data at maximum speed with only a single
> process? I tend to doubt it, but I guess you only have to be fast
> enough to saturate the network. Hmm.

Well, I can do >8GByte/s of buffered reads in a single process
(obviously cached, because I don't have storage quite that fast -
uncached I can read at nearly 3GByte/s, the disk's speed). So sure,
there's a limit to what a single process can do, but I think we're
fairly far away from it.

I think it's fairly obvious that we need faster compression - and that
while we clearly can win a lot by just using a faster
algorithm/implementation than standard zlib, we'll likely also need
parallelism in some form.  I'm doubtful that using multiple connections
and multiple backends is the best way to achieve that, but it'd be a
way.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Tue, Apr 21, 2020 at 4:14 PM Andres Freund <andres@anarazel.de> wrote:
> It was local TCP. The speeds I can reach are faster than the 10GiB/s
> (unidirectional) I can do between the laptop & workstation, so testing
> it over "actual" network isn't informative - I basically can reach line
> speed between them with any method.

Is that really a conclusive test, though? In the case of either local
TCP or a fast local interconnect, you'll have negligible latency. It
seems at least possible that saturating the available bandwidth is
harder on a higher-latency connection. Cross-region data center
connections figure to have way higher latency than a local wired
network, let alone the loopback interface.

> It was in kernel buffer cache. But I can reach 100% utilization of
> storage too (which is slightly slower than what I can do over unix
> socket).
>
> pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null
> 2.59GiB/s
> find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null
> 2.53GiB/s
> find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null
> 2.42GiB/s
>
> I tested this with a -s 5000 DB, FWIW.

But that's not a real test either, because you're not writing the data
anywhere. It's going to be a whole lot easier to saturate the read
side if the write side is always zero latency.

> > How do you expect to take advantage of I/O parallelism without
> > multiple processes/connections?
>
> Which kind of I/O parallelism are you thinking of? Independent
> tablespaces? Or devices that can handle multiple in-flight IOs? WRT the
> latter, at least linux will keep many IOs in-flight for sequential
> buffered reads.

Both. I know that the kernel will prefetch for sequential reads, but
it won't know what file you're going to access next, so I think you'll
tend to stall when you reach the end of each file. It also seems
possible that on a large disk array, you could read N files at a time
with greater aggregate bandwidth than you can read a single file.

> > It seems to me that the interesting cases may involve having lots of
> > available CPUs and lots of disk spindles, but a comparatively slow
> > pipe between the machines.
>
> Hm, I'm not sure I am following. If network is the bottleneck, we'd
> immediately fill the buffers, and that'd be that?
>
> ISTM all of this is only really relevant if either pg_basebackup or
> walsender is the bottleneck?

I agree that if neither pg_basebackup nor walsender is the bottleneck,
parallelism is unlikely to be very effective. I have realized as a
result of your comments that I actually don't care intrinsically about
parallel backup; what I actually care about is making backups very,
very fast. I suspect that parallelism is a useful means to that end,
but I interpret your comments as questioning that, and specifically
drawing attention to the question of where the bottlenecks might be.
So I'm trying to think about that.

> I think it's fairly obvious that we need faster compression - and that
> while we clearly can win a lot by just using a faster
> algorithm/implementation than standard zlib, we'll likely also need
> parallelism in some form.  I'm doubtful that using multiple connections
> and multiple backends is the best way to achieve that, but it'd be a
> way.

I think it has a good chance of being pretty effective, but it's
certainly worth casting about for other possibilities that might
deliver more benefit or be less work. In terms of better compression,
I did a little looking around and it seems like LZ4 is generally
agreed to be a lot faster than gzip, and also significantly faster
than most other things that one might choose to use. On the other
hand, the compression ratio may not be as good; e.g.
https://facebook.github.io/zstd/ cites a 2.1 ratio (on some data set)
for lz4 and a 2.9 ratio for zstd. While the compression and
decompression speeds are slower, they are close enough that you might
be able to make up the difference by using 2x the cores for
compression and 3x for decompression. I don't know if that sort of
thing is worth considering. If your limitation is the line speed, and
you have have CPU cores to burn, a significantly higher compression
ratio means significantly faster backups. On the other hand, if you're
backing up over the LAN and the machine is heavily taxed, that's
probably not an appealing trade.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-21 17:09:50 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 4:14 PM Andres Freund <andres@anarazel.de> wrote:
> > It was local TCP. The speeds I can reach are faster than the 10GiB/s
> > (unidirectional) I can do between the laptop & workstation, so testing
> > it over "actual" network isn't informative - I basically can reach line
> > speed between them with any method.
>
> Is that really a conclusive test, though? In the case of either local
> TCP or a fast local interconnect, you'll have negligible latency. It
> seems at least possible that saturating the available bandwidth is
> harder on a higher-latency connection. Cross-region data center
> connections figure to have way higher latency than a local wired
> network, let alone the loopback interface.

Sure. But that's what the TCP window etc should take care of. You might
have to tune the OS if you have a high latency multi-GBit link, but
you'd have to do that regardless of whether a single process or multiple
processes are used.  And the number of people with high-latency
multi-gbit links isn't that high, compared to the number taking backups
within a datacenter.


> > It was in kernel buffer cache. But I can reach 100% utilization of
> > storage too (which is slightly slower than what I can do over unix
> > socket).
> >
> > pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null
> > 2.59GiB/s
> > find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null
> > 2.53GiB/s
> > find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null
> > 2.42GiB/s
> >
> > I tested this with a -s 5000 DB, FWIW.
>
> But that's not a real test either, because you're not writing the data
> anywhere. It's going to be a whole lot easier to saturate the read
> side if the write side is always zero latency.

I also stored data elsewhere in separate threads. But the bottleneck of
that is lower (my storage is faster on reads than on writes, at least
after the ram on the nvme is exhausted)...


> > > It seems to me that the interesting cases may involve having lots of
> > > available CPUs and lots of disk spindles, but a comparatively slow
> > > pipe between the machines.
> >
> > Hm, I'm not sure I am following. If network is the bottleneck, we'd
> > immediately fill the buffers, and that'd be that?
> >
> > ISTM all of this is only really relevant if either pg_basebackup or
> > walsender is the bottleneck?
>
> I agree that if neither pg_basebackup nor walsender is the bottleneck,
> parallelism is unlikely to be very effective. I have realized as a
> result of your comments that I actually don't care intrinsically about
> parallel backup; what I actually care about is making backups very,
> very fast. I suspect that parallelism is a useful means to that end,
> but I interpret your comments as questioning that, and specifically
> drawing attention to the question of where the bottlenecks might be.
> So I'm trying to think about that.

I agree that trying to make backups very fast is a good goal (or well, I
think not very slow would be a good descriptor for the current
situation). I am just trying to make sure we tackle the right problems
for that. My gut feeling is that we have to tackle compression first,
because without addressing that "all hope is lost" ;)

FWIW, here's the base backup from pgbench -i -s 5000 compressed a number
of ways. The uncompressed backup is 64622701911 bytes. Unfortunately
pgbench -i -s 5000 is not a particularly good example, it's just too
compressible.


method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size        rate    format
gzip    1       1               380.79          368.46          12.15           3892457816      16.6    .gz
gzip    6       1               976.05          963.10          12.84           3594605389      18.0    .gz
pigz    1       10              34.35           364.14          23.55           3892401867      16.6    .gz
pigz    6       10              101.27          1056.85         28.98           3620724251      17.8    .gz
zstd-gz 1       1               278.14          265.31          12.81           3897174342      15.6    .gz
zstd-gz 1       6               906.67          893.58          12.52           3598238594      18.0    .gz
zstd    1       1               82.95           67.97           11.82           2853193736      22.6    .zstd
zstd    1       6               228.58          214.65          13.92           2687177334      24.0    .zstd
zstd    1       10              25.05           151.84          13.35           2847414913      22.7    .zstd
zstd    6       10              43.47           374.30          12.37           2745211100      23.5    .zstd
zstd    6       20              32.50           468.18          13.44           2745211100      23.5    .zstd
zstd    9       20              57.99           949.91          14.13           2606535138      24.8    .zstd
lz4     1       1               49.94           36.60           13.33           7318668265      8.8     .lz4
lz4     3       1               201.79          187.36          14.42           6561686116      9.84    .lz4
lz4     6       1               318.35          304.64          13.55           6560274369      9.9     .lz4
pixz    1       10              92.54           925.52          37.00           1199499772      53.8    .xz
pixz    3       10              210.77          2090.38         37.96           1186219752      54.5    .xz
bzip2   1       1               2210.04         2190.89         17.67           1276905211      50.6    .bz2
pbzip2  1       10              236.03          2352.09         34.01           1332010572      48.5    .bz2
plzip   1       10              243.08          2430.18         25.60           915598323       70.6    .lz
plzip   3       10              359.04          3577.94         27.92           1018585193      63.4    .lz
plzip   3       20              197.36          3911.85         22.02           1018585193      63.4    .lz

(zstd-gz is zstd with --format=gzip, zstd with parallelism 1 is with
--single-thread to avoid a separate IO thread it uses by default, even
with -T0)

These weren't taken on a completely quiesced system, and I tested gzip
and bzip2 in parallel, because they took so long. But I think this still
gives a good overview (cpu-user-time is not that affected by smaller
amounts of noise too).

It looks to me that bzip2/pbzip2 are clearly too slow. pixz looks
interesting as it achieves pretty good compression rates at a lower cost
than plzip. plzip's rates are impressive, but damn, is it expensive. And
higher compression ratios using more space is also a bit "huh"?


Does anybody have a better idea what exactly to use as a good test
corpus? pgbench -i clearly sucks, but ...


One thing this reminded me of is whether using a format (tar) that
doesn't allow efficient addressing of individual files is a good idea
for base backups. The compression rates very likely will be better when
not compressing tiny files individually, but at the same time it'd be
very useful to be able to access individual files more efficiently than
O(N). I can imagine that being important for some cases of incremental
backup assembly.


> > I think it's fairly obvious that we need faster compression - and that
> > while we clearly can win a lot by just using a faster
> > algorithm/implementation than standard zlib, we'll likely also need
> > parallelism in some form.  I'm doubtful that using multiple connections
> > and multiple backends is the best way to achieve that, but it'd be a
> > way.
>
> I think it has a good chance of being pretty effective, but it's
> certainly worth casting about for other possibilities that might
> deliver more benefit or be less work. In terms of better compression,
> I did a little looking around and it seems like LZ4 is generally
> agreed to be a lot faster than gzip, and also significantly faster
> than most other things that one might choose to use. On the other
> hand, the compression ratio may not be as good; e.g.
> https://facebook.github.io/zstd/ cites a 2.1 ratio (on some data set)
> for lz4 and a 2.9 ratio for zstd. While the compression and
> decompression speeds are slower, they are close enough that you might
> be able to make up the difference by using 2x the cores for
> compression and 3x for decompression. I don't know if that sort of
> thing is worth considering. If your limitation is the line speed, and
> you have have CPU cores to burn, a significantly higher compression
> ratio means significantly faster backups. On the other hand, if you're
> backing up over the LAN and the machine is heavily taxed, that's
> probably not an appealing trade.

I think zstd with a low compression "setting" would be a pretty good
default for most cases. lz4 is considerably faster, true, but the
compression rates are also considerably worse. I think lz4 is great for
mostly in-memory workloads (e.g. a compressed cache / live database with
compressed data, as it allows to have reasonably close to memory speeds
but with twice the data), but for anything longer lived zstd is probably
better.

The other big benefit is that zstd's library has multi-threaded
compression built in, whereas that's not the case for other libraries
that I am aware of.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Tue, Apr 21, 2020 at 6:57 PM Andres Freund <andres@anarazel.de> wrote:
> I agree that trying to make backups very fast is a good goal (or well, I
> think not very slow would be a good descriptor for the current
> situation). I am just trying to make sure we tackle the right problems
> for that. My gut feeling is that we have to tackle compression first,
> because without addressing that "all hope is lost" ;)

OK. I have no objection to the idea of starting with (1) server side
compression and (2) a better compression algorithm. However, I'm not
very sold on the idea of relying on parallelism that is specific to
compression. I think that parallelism across the whole operation -
multiple connections, multiple processes, etc. - may be a more
promising approach than trying to parallelize specific stages of the
process. I am not sure about that; it could be wrong, and I'm open to
the possibility that it is, in fact, wrong.

Leaving out all the three and four digit wall times from your table:

> method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
> pigz    1       10              34.35           364.14          23.55           3892401867      16.6    .gz
> zstd    1       1               82.95           67.97           11.82           2853193736      22.6    .zstd
> zstd    1       10              25.05           151.84          13.35           2847414913      22.7    .zstd
> zstd    6       10              43.47           374.30          12.37           2745211100      23.5    .zstd
> zstd    6       20              32.50           468.18          13.44           2745211100      23.5    .zstd
> zstd    9       20              57.99           949.91          14.13           2606535138      24.8    .zstd
> lz4     1       1               49.94           36.60           13.33           7318668265      8.8     .lz4
> pixz    1       10              92.54           925.52          37.00           1199499772      53.8    .xz

It's notable that almost all of the fast wall times here are with
zstd; the surviving entries with pigz and pixz are with ten-way
parallelism, and both pigz and lz4 have worse compression ratios than
zstd. My impression, though, is that LZ4 might be getting a bit of a
raw deal here because of the repetitive nature of the data. I theorize
based on some reading I did yesterday, and general hand-waving, that
maybe the compression ratios would be closer together on a more
realistic data set. It's also notable that lz1 -1 is BY FAR the winner
in terms of absolute CPU consumption. So I kinda wonder whether
supporting both LZ4 and ZSTD might be the way to go, especially since
once we have the LZ4 code we might be able to use it for other things,
too.

> One thing this reminded me of is whether using a format (tar) that
> doesn't allow efficient addressing of individual files is a good idea
> for base backups. The compression rates very likely will be better when
> not compressing tiny files individually, but at the same time it'd be
> very useful to be able to access individual files more efficiently than
> O(N). I can imagine that being important for some cases of incremental
> backup assembly.

Yeah, being able to operate directly on the compressed version of the
file would be very useful, but I'm not sure that we have great options
available there. I think the only widely-used format that supports
that is ".zip", and I'm not too sure about emitting zip files.
Apparently, pixz also supports random access to archive members, and
it did have on entry that survived my arbitrary cut in the table
above, but the last release was in 2015, and it seems to be only a
command-line tool, not a library. It also depends on libarchive and
liblzma, which is not awful, but I'm not sure we want to suck in that
many dependencies. But that's really a secondary thing: I can't
imagine us depending on something that hasn't had a release in 5
years, and has less than 300 total commits.

Now, it is based on xz/liblzma, and those seems to have some built-in
indexing capabilities which it may be leveraging, so possibly we could
roll our own. I'm not too sure about that, though, and it would limit
us to using only that form of compression.

Other options include, perhaps, (1) emitting a tarfile of compressed
files instead of a compressed tarfile, and (2) writing our own index
files. We don't know when we begin emitting the tarfile what files
we're going to find our how big they will be, so we can't really emit
a directory at the beginning of the file. Even if we thought we knew,
files can disappear or be truncated before we get around to archiving
them. However, when we reach the end of the file, we do know what we
included and how big it was, so possibly we could generate an index
for each tar file, or include something in the backup manifest.

> The other big benefit is that zstd's library has multi-threaded
> compression built in, whereas that's not the case for other libraries
> that I am aware of.

Wouldn't it be a problem to let the backend become multi-threaded, at
least on Windows?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-22 09:52:53 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 6:57 PM Andres Freund <andres@anarazel.de> wrote:
> > I agree that trying to make backups very fast is a good goal (or well, I
> > think not very slow would be a good descriptor for the current
> > situation). I am just trying to make sure we tackle the right problems
> > for that. My gut feeling is that we have to tackle compression first,
> > because without addressing that "all hope is lost" ;)
> 
> OK. I have no objection to the idea of starting with (1) server side
> compression and (2) a better compression algorithm. However, I'm not
> very sold on the idea of relying on parallelism that is specific to
> compression. I think that parallelism across the whole operation -
> multiple connections, multiple processes, etc. - may be a more
> promising approach than trying to parallelize specific stages of the
> process. I am not sure about that; it could be wrong, and I'm open to
> the possibility that it is, in fact, wrong.

*My* gut feeling is that you're going to have a harder time using CPU
time efficiently when doing parallel compression via multiple processes
and independent connections. You're e.g. going to have a lot more
context switches, I think. And there will be network overhead from doing
more connections (including worse congestion control).


> Leaving out all the three and four digit wall times from your table:
> 
> > method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
> > pigz    1       10              34.35           364.14          23.55           3892401867      16.6    .gz
> > zstd    1       1               82.95           67.97           11.82           2853193736      22.6    .zstd
> > zstd    1       10              25.05           151.84          13.35           2847414913      22.7    .zstd
> > zstd    6       10              43.47           374.30          12.37           2745211100      23.5    .zstd
> > zstd    6       20              32.50           468.18          13.44           2745211100      23.5    .zstd
> > zstd    9       20              57.99           949.91          14.13           2606535138      24.8    .zstd
> > lz4     1       1               49.94           36.60           13.33           7318668265      8.8     .lz4
> > pixz    1       10              92.54           925.52          37.00           1199499772      53.8    .xz
> 
> It's notable that almost all of the fast wall times here are with
> zstd; the surviving entries with pigz and pixz are with ten-way
> parallelism, and both pigz and lz4 have worse compression ratios than
> zstd. My impression, though, is that LZ4 might be getting a bit of a
> raw deal here because of the repetitive nature of the data. I theorize
> based on some reading I did yesterday, and general hand-waving, that
> maybe the compression ratios would be closer together on a more
> realistic data set.

I agree that most datasets won't get even close to what we've seen
here. And that disadvantages e.g. lz4.

To come up with a much less compressible case, I generated data the
following way:

CREATE TABLE random_data(id serial NOT NULL, r1 float not null, r2 float not null, r3 float not null);
ALTER TABLE random_data SET (FILLFACTOR = 100);
ALTER SEQUENCE random_data_id_seq CACHE 1024
-- with pgbench, I ran this in parallel for 100s
INSERT INTO random_data(r1,r2,r3) SELECT random(), random(), random() FROM generate_series(1, 100000);
-- then created indexes, using a high fillfactor to ensure few zeroed out parts
ALTER TABLE random_data ADD CONSTRAINT random_data_id_pkey PRIMARY KEY(id) WITH (FILLFACTOR = 100);
CREATE INDEX random_data_r1 ON random_data(r1) WITH (fillfactor = 100);

this results in a 16GB base backup. I think this is probably a good bit
less compressible than most PG databases.


method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
gzip    1       1               305.37          299.72          5.52            7067232465      2.28
lz4     1       1               33.26           27.26           5.99            8961063439      1.80     .lz4
lz4     3       1               188.50          182.91          5.58            8204501460      1.97     .lz4
zstd    1       1               66.41           58.38           6.04            6925634128      2.33     .zstd
zstd    1       10              9.64            67.04           4.82            6980075316      2.31     .zstd
zstd    3       1               122.04          115.79          6.24            6440274143      2.50     .zstd
zstd    3       10              13.65           106.11          5.64            6438439095      2.51     .zstd
zstd    9       10              100.06          955.63          6.79            5963827497      2.71     .zstd
zstd    15      10              259.84          2491.39         8.88            5912617243      2.73     .zstd
pixz    1       10              162.59          1626.61         15.52           5350138420      3.02     .xz
plzip   1       20              135.54          2705.28         9.25            5270033640      3.06     .lz


> It's also notable that lz1 -1 is BY FAR the winner in terms of
> absolute CPU consumption. So I kinda wonder whether supporting both
> LZ4 and ZSTD might be the way to go, especially since once we have the
> LZ4 code we might be able to use it for other things, too.

Yea. I think the case for lz4 is far stronger in other
places. E.g. having lz4 -1 for toast can make a lot of sense, suddenly
repeated detoasting is much less of an issue, while still achieving
higher compression than pglz.

.oO(Now I really see how pglz compares to the above)


> > One thing this reminded me of is whether using a format (tar) that
> > doesn't allow efficient addressing of individual files is a good idea
> > for base backups. The compression rates very likely will be better when
> > not compressing tiny files individually, but at the same time it'd be
> > very useful to be able to access individual files more efficiently than
> > O(N). I can imagine that being important for some cases of incremental
> > backup assembly.
> 
> Yeah, being able to operate directly on the compressed version of the
> file would be very useful, but I'm not sure that we have great options
> available there. I think the only widely-used format that supports
> that is ".zip", and I'm not too sure about emitting zip files.

I don't really see a problem with emitting .zip files. It's an extremely
widely used container format for all sorts of file formats these days.
Except for needing a bit more complicated (and I don't think it's *that*
big of a difference) code during generation / unpacking, it seems
clearly advantageous over .tar.gz etc.


> Apparently, pixz also supports random access to archive members, and
> it did have on entry that survived my arbitrary cut in the table
> above, but the last release was in 2015, and it seems to be only a
> command-line tool, not a library. It also depends on libarchive and
> liblzma, which is not awful, but I'm not sure we want to suck in that
> many dependencies. But that's really a secondary thing: I can't
> imagine us depending on something that hasn't had a release in 5
> years, and has less than 300 total commits.

Oh, yea. I just looked at the various tools I could find that did
parallel compression.


> Other options include, perhaps, (1) emitting a tarfile of compressed
> files instead of a compressed tarfile

Yea, that'd help some. Although I am not sure how good the tooling to
seek through tarfiles in an O(files) rather than O(bytes) manner is.

I think there some cases where using separate compression state for each
file would hurt us. Some of the archive formats have support for reusing
compression state, but I don't know which.


> , and (2) writing our own index files. We don't know when we begin
> emitting the tarfile what files we're going to find our how big they
> will be, so we can't really emit a directory at the beginning of the
> file. Even if we thought we knew, files can disappear or be truncated
> before we get around to archiving them. However, when we reach the end
> of the file, we do know what we included and how big it was, so
> possibly we could generate an index for each tar file, or include
> something in the backup manifest.

Hm. There's some appeal to just store offsets in the manifest, and to
make sure it's a seakable offset in the compression stream. OTOH, it
makes it pretty hard for other tools to generate a compatible archive.


> > The other big benefit is that zstd's library has multi-threaded
> > compression built in, whereas that's not the case for other libraries
> > that I am aware of.
> 
> Wouldn't it be a problem to let the backend become multi-threaded, at
> least on Windows?

We already have threads in windows, e.g. the signal handler emulation
stuff runs in one. Are you thinking of this bit in postmaster.c:

#ifdef HAVE_PTHREAD_IS_THREADED_NP

    /*
     * On macOS, libintl replaces setlocale() with a version that calls
     * CFLocaleCopyCurrent() when its second argument is "" and every relevant
     * environment variable is unset or empty.  CFLocaleCopyCurrent() makes
     * the process multithreaded.  The postmaster calls sigprocmask() and
     * calls fork() without an immediate exec(), both of which have undefined
     * behavior in a multithreaded program.  A multithreaded postmaster is the
     * normal case on Windows, which offers neither fork() nor sigprocmask().
     */
    if (pthread_is_threaded_np() != 0)
        ereport(FATAL,
                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                 errmsg("postmaster became multithreaded during startup"),
                 errhint("Set the LC_ALL environment variable to a valid locale.")));
#endif

?

I don't really see any of the concerns there to apply for the base
backup case.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Wed, Apr 22, 2020 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
> *My* gut feeling is that you're going to have a harder time using CPU
> time efficiently when doing parallel compression via multiple processes
> and independent connections. You're e.g. going to have a lot more
> context switches, I think. And there will be network overhead from doing
> more connections (including worse congestion control).

OK, noted. I'm still doubtful that the optimal number of connections
is 1, but it might be that the optimal number of CPU cores to apply to
compression is much higher than the optimal number of connections. For
instance, suppose there are two equally sized tablespaces on separate
drives, but zstd with 10-way parallelism is our chosen compression
strategy. It seems to me that two connections has an excellent chance
of being faster than one, because with only one connection I don't see
how you can benefit from the opportunity to do I/O in parallel.
However, I can also see that having twenty connections just as a way
to get 10-way parallelism for each tablespace might be undesirable
and/or inefficient for various reasons.

> this results in a 16GB base backup. I think this is probably a good bit
> less compressible than most PG databases.
>
> method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
> gzip    1       1               305.37          299.72          5.52            7067232465      2.28
> lz4     1       1               33.26           27.26           5.99            8961063439      1.80     .lz4
> lz4     3       1               188.50          182.91          5.58            8204501460      1.97     .lz4
> zstd    1       1               66.41           58.38           6.04            6925634128      2.33     .zstd
> zstd    1       10              9.64            67.04           4.82            6980075316      2.31     .zstd
> zstd    3       1               122.04          115.79          6.24            6440274143      2.50     .zstd
> zstd    3       10              13.65           106.11          5.64            6438439095      2.51     .zstd
> zstd    9       10              100.06          955.63          6.79            5963827497      2.71     .zstd
> zstd    15      10              259.84          2491.39         8.88            5912617243      2.73     .zstd
> pixz    1       10              162.59          1626.61         15.52           5350138420      3.02     .xz
> plzip   1       20              135.54          2705.28         9.25            5270033640      3.06     .lz

So, picking a better compressor in this case looks a lot less
exciting. Parallel zstd still compresses somewhat better than
single-core lz4, but the difference in compression ratio is far less,
and the amount of CPU you have to burn in order to get that extra
compression is pretty large.

> I don't really see a problem with emitting .zip files. It's an extremely
> widely used container format for all sorts of file formats these days.
> Except for needing a bit more complicated (and I don't think it's *that*
> big of a difference) code during generation / unpacking, it seems
> clearly advantageous over .tar.gz etc.

Wouldn't that imply buying into DEFLATE as our preferred compression algorithm?

Either way, I don't really like the idea of having PostgreSQL have its
own code to generate and interpret various archive formats. That seems
like a maintenance nightmare and a recipe for bugs. How can anyone
even verify that our existing 'tar' code works with all 'tar'
implementations out there, or that it's correct in all cases? Do we
really want to maintain similar code for other formats, or even for
this one? I'd say "no". We should pick archive formats that have good,
well-maintained libraries with permissive licenses and then use those.
I don't know whether "zip" falls into that category or not.

> > Other options include, perhaps, (1) emitting a tarfile of compressed
> > files instead of a compressed tarfile
>
> Yea, that'd help some. Although I am not sure how good the tooling to
> seek through tarfiles in an O(files) rather than O(bytes) manner is.

Well, considering that at present we're using hand-rolled code...

> I think there some cases where using separate compression state for each
> file would hurt us. Some of the archive formats have support for reusing
> compression state, but I don't know which.

Yeah, I had the same thought. People with mostly 1GB relation segments
might not notice much difference, but people with lots of little
relations might see a more significant difference.

> Hm. There's some appeal to just store offsets in the manifest, and to
> make sure it's a seakable offset in the compression stream. OTOH, it
> makes it pretty hard for other tools to generate a compatible archive.

Yeah.

FWIW, I don't see it as being entirely necessary to create a seekable
compressed archive format, let alone to make all of our compressed
archive formats seekable. I think supporting multiple compression
algorithms in a flexible way that's not too tied to the capabilities
of particular algorithms is more important. If you want fast restores
of incremental and differential backups, consider using -Fp rather
than -Ft. Or we can have a new option that's like -Fp but every file
is compressed individually in place, or files larger than N bytes are
compressed in place using a configurable algorithm. It might be
somewhat less efficient but it's also way less complicated to
implement, and I think that should count for something. I don't want
to get so caught up in advanced features here that we don't make any
useful progress at all. If we can add better features without a large
complexity increment, and without drawing objections from others on
this list, great. If not, I'm prepared to summarily jettison it as
nice-to-have but not essential.

> I don't really see any of the concerns there to apply for the base
> backup case.

I felt like there was some reason that threads were bad, but it may
have just been the case you mentioned and not relevant here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Peter Eisentraut
Дата:
On 2020-04-20 22:36, Robert Haas wrote:
> My suspicion is that it has mostly to do with adequately utilizing the
> hardware resources on the server side. If you are network-constrained,
> adding more connections won't help, unless there's something shaping
> the traffic which can be gamed by having multiple connections.

This is a thing.  See "long fat network" and "bandwidth-delay product" 
(https://en.wikipedia.org/wiki/Bandwidth-delay_product).  The proper way 
to address this is presumably with TCP parameter tuning, but in practice 
it's often easier to just start multiple connections, for example, when 
doing a backup via rsync.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: design for parallel backup

От
Robert Haas
Дата:
On Wed, Apr 22, 2020 at 12:20 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2020-04-20 22:36, Robert Haas wrote:
> > My suspicion is that it has mostly to do with adequately utilizing the
> > hardware resources on the server side. If you are network-constrained,
> > adding more connections won't help, unless there's something shaping
> > the traffic which can be gamed by having multiple connections.
>
> This is a thing.  See "long fat network" and "bandwidth-delay product"
> (https://en.wikipedia.org/wiki/Bandwidth-delay_product).  The proper way
> to address this is presumably with TCP parameter tuning, but in practice
> it's often easier to just start multiple connections, for example, when
> doing a backup via rsync.

Very interesting -- thanks!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-22 12:12:32 -0400, Robert Haas wrote:
> On Wed, Apr 22, 2020 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
> > *My* gut feeling is that you're going to have a harder time using CPU
> > time efficiently when doing parallel compression via multiple processes
> > and independent connections. You're e.g. going to have a lot more
> > context switches, I think. And there will be network overhead from doing
> > more connections (including worse congestion control).
> 
> OK, noted. I'm still doubtful that the optimal number of connections
> is 1, but it might be that the optimal number of CPU cores to apply to
> compression is much higher than the optimal number of connections.

Yea, that's basically what I think too.


> For instance, suppose there are two equally sized tablespaces on
> separate drives, but zstd with 10-way parallelism is our chosen
> compression strategy. It seems to me that two connections has an
> excellent chance of being faster than one, because with only one
> connection I don't see how you can benefit from the opportunity to do
> I/O in parallel.

Yea. That's exactly the case for "connection level" parallelism I had
upthread as well. It'd require being somewhat careful about different
tablespaces in the selection for each connection, but that's not that
hard.

I also can see a case for using N backends and one connection, but I
think that'll be too complicated / too much bound by lcoking around the
socket etc.


> 
> > this results in a 16GB base backup. I think this is probably a good bit
> > less compressible than most PG databases.
> >
> > method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
> > gzip    1       1               305.37          299.72          5.52            7067232465      2.28
> > lz4     1       1               33.26           27.26           5.99            8961063439      1.80     .lz4
> > lz4     3       1               188.50          182.91          5.58            8204501460      1.97     .lz4
> > zstd    1       1               66.41           58.38           6.04            6925634128      2.33     .zstd
> > zstd    1       10              9.64            67.04           4.82            6980075316      2.31     .zstd
> > zstd    3       1               122.04          115.79          6.24            6440274143      2.50     .zstd
> > zstd    3       10              13.65           106.11          5.64            6438439095      2.51     .zstd
> > zstd    9       10              100.06          955.63          6.79            5963827497      2.71     .zstd
> > zstd    15      10              259.84          2491.39         8.88            5912617243      2.73     .zstd
> > pixz    1       10              162.59          1626.61         15.52           5350138420      3.02     .xz
> > plzip   1       20              135.54          2705.28         9.25            5270033640      3.06     .lz
> 
> So, picking a better compressor in this case looks a lot less
> exciting.

Oh? I find it *extremely* exciting here. This is pretty close to the
worst case compressability-wise, and zstd takes only ~22% of the time as
gzip does, while still delivering better compression.  A nearly 5x
improvement in compression times seems pretty exciting to me.

Or do you mean for zstd over lz4, rather than anything over gzip?  1.8x
-> 2.3x is a pretty decent improvement still, no? And being able to do
do it in 1/3 of the wall time seems pretty helpful.

> Parallel zstd still compresses somewhat better than single-core lz4,
> but the difference in compression ratio is far less, and the amount of
> CPU you have to burn in order to get that extra compression is pretty
> large.

It's "just" a ~2x difference for "level 1" compression, right? For
having 1.9GiB less to write / permanently store of a 16GiB base
backup that doesn't seem that bad to me.


> > I don't really see a problem with emitting .zip files. It's an extremely
> > widely used container format for all sorts of file formats these days.
> > Except for needing a bit more complicated (and I don't think it's *that*
> > big of a difference) code during generation / unpacking, it seems
> > clearly advantageous over .tar.gz etc.
> 
> Wouldn't that imply buying into DEFLATE as our preferred compression algorithm?

zip doesn't have to imply DEFLATE although it is the most common
option. There's a compression method associated with each file.


> Either way, I don't really like the idea of having PostgreSQL have its
> own code to generate and interpret various archive formats. That seems
> like a maintenance nightmare and a recipe for bugs. How can anyone
> even verify that our existing 'tar' code works with all 'tar'
> implementations out there, or that it's correct in all cases? Do we
> really want to maintain similar code for other formats, or even for
> this one? I'd say "no". We should pick archive formats that have good,
> well-maintained libraries with permissive licenses and then use those.
> I don't know whether "zip" falls into that category or not.

I agree we should pick one. I think tar is not a great choice. .zip
seems like it'd be a significant improvement - but not necessarily
optimal.


> > > Other options include, perhaps, (1) emitting a tarfile of compressed
> > > files instead of a compressed tarfile
> >
> > Yea, that'd help some. Although I am not sure how good the tooling to
> > seek through tarfiles in an O(files) rather than O(bytes) manner is.
> 
> Well, considering that at present we're using hand-rolled code...

Good point.

Also looks like at least gnu tar supports seeking (when not reading from
a pipe etc).


> > I think there some cases where using separate compression state for each
> > file would hurt us. Some of the archive formats have support for reusing
> > compression state, but I don't know which.
> 
> Yeah, I had the same thought. People with mostly 1GB relation segments
> might not notice much difference, but people with lots of little
> relations might see a more significant difference.

Yea. I suspect it's close to immeasurable for large relations.  Reusing
the dictionary might help, although it likely would imply some
overhead. OTOH, the overhead of small relations will usually probably be
in the number of files, rather than the actual size.


FWIW, not that it's really relevant to this discussion, but I played
around with using trained compression dictionaries for postgres
contents. Can improve e.g. lz4's compression ratio a fair bit, in
particular when compressing small amounts of data. E.g. per-block
compression or such.


> FWIW, I don't see it as being entirely necessary to create a seekable
> compressed archive format, let alone to make all of our compressed
> archive formats seekable. I think supporting multiple compression
> algorithms in a flexible way that's not too tied to the capabilities
> of particular algorithms is more important. If you want fast restores
> of incremental and differential backups, consider using -Fp rather
> than -Ft.

Given how compressible many real-world databases are (maybe not quite
the 50x as in the pgbench -i case, but still extremely so), I don't
quite find -Fp a convincing alternative.


> Or we can have a new option that's like -Fp but every file
> is compressed individually in place, or files larger than N bytes are
> compressed in place using a configurable algorithm. It might be
> somewhat less efficient but it's also way less complicated to
> implement, and I think that should count for something.

Yea, I think that'd be a decent workaround.


> I don't want to get so caught up in advanced features here that we
> don't make any useful progress at all. If we can add better features
> without a large complexity increment, and without drawing objections
> from others on this list, great. If not, I'm prepared to summarily
> jettison it as nice-to-have but not essential.

Just to be clear: I am not at all advocating tying a change of the
archive format to compression method / parallelism changes or anything.


> > I don't really see any of the concerns there to apply for the base
> > backup case.
> 
> I felt like there was some reason that threads were bad, but it may
> have just been the case you mentioned and not relevant here.

I mean, they do have some serious issues when postgres infrastructure is
needed. Not being threadsafe and all. One needs to be careful to not let
"threads escape", to not fork() etc. That doesn't seems like a problem
here though.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Wed, Apr 22, 2020 at 2:06 PM Andres Freund <andres@anarazel.de> wrote:
> I also can see a case for using N backends and one connection, but I
> think that'll be too complicated / too much bound by lcoking around the
> socket etc.

Agreed.

> Oh? I find it *extremely* exciting here. This is pretty close to the
> worst case compressability-wise, and zstd takes only ~22% of the time as
> gzip does, while still delivering better compression.  A nearly 5x
> improvement in compression times seems pretty exciting to me.
>
> Or do you mean for zstd over lz4, rather than anything over gzip?  1.8x
> -> 2.3x is a pretty decent improvement still, no? And being able to do
> do it in 1/3 of the wall time seems pretty helpful.

I meant the latter thing, not the former. I'm taking it as given that
we don't want gzip as the only option. Yes, 1.8x -> 2.3x is decent,
but not as earth-shattering as 8.8x -> ~24x.

In any case, I lean towards adding both lz4 and zstd as options, so I
guess we're not really disagreeing here

> > Parallel zstd still compresses somewhat better than single-core lz4,
> > but the difference in compression ratio is far less, and the amount of
> > CPU you have to burn in order to get that extra compression is pretty
> > large.
>
> It's "just" a ~2x difference for "level 1" compression, right? For
> having 1.9GiB less to write / permanently store of a 16GiB base
> backup that doesn't seem that bad to me.

Sure, sure. I'm just saying some people may not be OK with ramping up
to 10 or more compression threads on their master server, if it's
already heavily loaded, and maybe only has 4 vCPUs or whatever, so we
should have lighter-weight options for those people. I'm not trying to
argue against zstd or against the idea of ramping up large numbers of
compression threads, just saying that lz4 looks awfully nice for
people who need some compression but are tight on CPU cycles.

> I agree we should pick one. I think tar is not a great choice. .zip
> seems like it'd be a significant improvement - but not necessarily
> optimal.

Other ideas?

> > I don't want to get so caught up in advanced features here that we
> > don't make any useful progress at all. If we can add better features
> > without a large complexity increment, and without drawing objections
> > from others on this list, great. If not, I'm prepared to summarily
> > jettison it as nice-to-have but not essential.
>
> Just to be clear: I am not at all advocating tying a change of the
> archive format to compression method / parallelism changes or anything.

Good, thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-22 14:40:17 -0400, Robert Haas wrote:
> > Oh? I find it *extremely* exciting here. This is pretty close to the
> > worst case compressability-wise, and zstd takes only ~22% of the time as
> > gzip does, while still delivering better compression.  A nearly 5x
> > improvement in compression times seems pretty exciting to me.
> >
> > Or do you mean for zstd over lz4, rather than anything over gzip?  1.8x
> > -> 2.3x is a pretty decent improvement still, no? And being able to do
> > do it in 1/3 of the wall time seems pretty helpful.
> 
> I meant the latter thing, not the former. I'm taking it as given that
> we don't want gzip as the only option. Yes, 1.8x -> 2.3x is decent,
> but not as earth-shattering as 8.8x -> ~24x.

Ah, good.


> In any case, I lean towards adding both lz4 and zstd as options, so I
> guess we're not really disagreeing here

We're agreeing, indeed ;)


> > I agree we should pick one. I think tar is not a great choice. .zip
> > seems like it'd be a significant improvement - but not necessarily
> > optimal.
> 
> Other ideas?

The 7zip format, perhaps. Does have format level support to address what
we were discussing earlier: "Support for solid compression, where
multiple files of like type are compressed within a single stream, in
order to exploit the combined redundancy inherent in similar files.".

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Wed, Apr 22, 2020 at 3:03 PM Andres Freund <andres@anarazel.de> wrote:
> The 7zip format, perhaps. Does have format level support to address what
> we were discussing earlier: "Support for solid compression, where
> multiple files of like type are compressed within a single stream, in
> order to exploit the combined redundancy inherent in similar files.".

I think that might not be a great choice. One potential problem is
that according to https://www.7-zip.org/license.txt the license is
partly LGPL, partly three-clause BSD with an advertising clause, and
partly some strange mostly-free thing with reverse-engineering
restrictions. That sounds pretty unappealing to me as a key dependency
for core technology. It also seems like it's mostly a Windows thing.
p7zip, the "port of the command line version of 7-Zip to Linux/Posix",
last released a new version in 2016. I therefore think that there is
room to question how well supported this is all going to be on the
systems where most of us work all day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Robert Haas
Дата:
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> One question I have not really seen answered well:
>
> Why do we want parallelism here. Or to be more precise: What do we hope
> to accelerate by making what part of creating a base backup
> parallel. There's several potential bottlenecks, and I think it's
> important to know the design priorities to evaluate a potential design.

I spent some time today trying to understand just one part of this,
which is how long it will take to write the base backup out to disk
and whether having multiple independent processes helps. I settled on
writing and fsyncing 64GB of data, written in 8kB chunks, divided into
1, 2, 4, 8, or 16 equal size files, with each file written by a
separate process, and an fsync() at the end before process exit. So in
this test, there is no question of whether the master can read the
data fast enough, nor is there any issue of network bandwidth. It's
purely a test of whether it's faster to have one process write a big
file or whether it's faster to have multiple processes each write a
smaller file.

I tested this on EDB's cthulhu. It's an older server, but it happens
to have 4 mount points available for testing, one with XFS + magnetic
disks, one with ext4 + magnetic disks, one with XFS + SSD, and one
with ext4 + SSD. I did the experiment described above on each mount
point separately, and then I also tried 4, 8, or 16 equal size files
spread evenly across the 4 mount points. To summarize the results very
briefly:

1. ext4 degraded really badly with >4 concurrent writers. XFS did not.
2. SSDs were faster than magnetic disks, but you had to use XFS and
>=4 concurrent writers to get the benefit.
3. Spreading writes across the mount points works well, but the
slowest mount point sets the pace.

Here are more detailed results, with times in seconds:

filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
xfs mag 97 53 60 67 71
ext4 mag 94 68 66 335 549
xfs ssd 97 55 33 27 25
ext4 ssd 116 70 66 227 450
spread spread n/a n/a 48 42 44

The spread test with 16 files @ 4GB llooks like this:

[/mnt/data-ssd/robert.haas/test14] open: 0, write: 7, fsync: 0, close:
0, total: 7
[/mnt/data-ssd/robert.haas/test10] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd/robert.haas/test2] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd/robert.haas/test6] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd2/robert.haas/test3] open: 0, write: 16, fsync: 0,
close: 0, total: 16
[/mnt/data-ssd2/robert.haas/test11] open: 0, write: 16, fsync: 0,
close: 0, total: 16
[/mnt/data-ssd2/robert.haas/test15] open: 0, write: 17, fsync: 0,
close: 0, total: 17
[/mnt/data-ssd2/robert.haas/test7] open: 0, write: 18, fsync: 0,
close: 0, total: 18
[/mnt/data-mag/robert.haas/test16] open: 0, write: 7, fsync: 18,
close: 0, total: 25
[/mnt/data-mag/robert.haas/test4] open: 0, write: 7, fsync: 19, close:
0, total: 26
[/mnt/data-mag/robert.haas/test12] open: 0, write: 7, fsync: 19,
close: 0, total: 26
[/mnt/data-mag/robert.haas/test8] open: 0, write: 7, fsync: 22, close:
0, total: 29
[/mnt/data-mag2/robert.haas/test9] open: 0, write: 20, fsync: 23,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test13] open: 0, write: 18, fsync: 25,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test5] open: 0, write: 19, fsync: 24,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test1] open: 0, write: 18, fsync: 25,
close: 0, total: 43

The fastest write performance of any test was the 16-way XFS-SSD test,
which wrote at about 2.56 gigabytes per second. The fastest
single-file test was on ext4-magnetic, though ext4-ssd and
xfs-magnetic were similar, around 0.66 gigabytes per second. Your
system must be a LOT faster, because you were seeing pg_basebackup
running at, IIUC, ~3 gigabytes per second, and that would have been a
second process both writing and doing other things. For comparison,
some recent local pg_basebackup testing on this machine by some of my
colleagues ran at about 0.82 gigabytes per second.

I suspect it would be possible to get significantly higher numbers on
this hardware by (1) changing all the filesystems over to XFS and (2)
dividing the data dynamically based on write speed rather than writing
the same amount of it everywhere. I bet we could reach 6-8 gigabytes
per second if we did all that.

Now, I don't know how much this matters. To get limited by this stuff,
you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit
Ethernet or something like that - or to be doing a local backup. But I
thought that it was interesting and that I should share it, so here
you go! I do wonder if the apparently concurrency problems with ext4
might matter on systems with high connection counts just in normal
operation, backups aside.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-04-30 14:50:34 -0400, Robert Haas wrote:
> On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> > One question I have not really seen answered well:
> >
> > Why do we want parallelism here. Or to be more precise: What do we hope
> > to accelerate by making what part of creating a base backup
> > parallel. There's several potential bottlenecks, and I think it's
> > important to know the design priorities to evaluate a potential design.
>
> I spent some time today trying to understand just one part of this,
> which is how long it will take to write the base backup out to disk
> and whether having multiple independent processes helps. I settled on
> writing and fsyncing 64GB of data, written in 8kB chunks

Why 8kb? That's smaller than what we currently do in pg_basebackup,
afaictl, and you're actually going to be bottlenecked by syscall
overhead at that point (unless you disable / don't have the whole intel
security mitigation stuff).


> , divided into 1, 2, 4, 8, or 16 equal size files, with each file
> written by a separate process, and an fsync() at the end before
> process exit. So in this test, there is no question of whether the
> master can read the data fast enough, nor is there any issue of
> network bandwidth. It's purely a test of whether it's faster to have
> one process write a big file or whether it's faster to have multiple
> processes each write a smaller file.

That's not necessarily the only question though, right? There's also the
approach one process writing out multiple files (via buffered, not async
IO)? E.g. one basebackup connecting to multiple backends, or just
shuffeling multiple files through one copy stream.


> I tested this on EDB's cthulhu. It's an older server, but it happens
> to have 4 mount points available for testing, one with XFS + magnetic
> disks, one with ext4 + magnetic disks, one with XFS + SSD, and one
> with ext4 + SSD.

IIRC cthulhu's SSDs are not that fast, compared to NVMe storage (by
nearly an order of magnitude IIRC). So this might be disadvantaging the
parallel case more than it should. Also perhaps the ext4 disadvantage is
smaller on more modern kernel versions?

If you can provide me with the test program, I'd happily run it on some
decent, but not upper end, NVMe SSDs.


> The fastest write performance of any test was the 16-way XFS-SSD test,
> which wrote at about 2.56 gigabytes per second. The fastest
> single-file test was on ext4-magnetic, though ext4-ssd and
> xfs-magnetic were similar, around 0.66 gigabytes per second.

I think you might also be seeing some interaction with write caching on
the raid controller here. The file sizes are small enough to fit in
there to a significant degree for the single file tests.


> Your system must be a LOT faster, because you were seeing
> pg_basebackup running at, IIUC, ~3 gigabytes per second, and that
> would have been a second process both writing and doing other
> things.

Right. On my workstation I have a NVMe SSD that can do ~2.5 GiB/s
sustained, in my laptop one that peaks to ~3.2GiB/s but then quickly goes
to ~2GiB/s.

FWIW, I ran a "benchmark" just now just using dd, on my laptop, on
battery (so take this with a huge grain of salt). With 1 dd writing out
150GiB in 8kb blocks I get 1.8GiB/s, and with two writing 75GiB each
~840MiB/s, with three writing 50GiB each 550MiB/s.


> Now, I don't know how much this matters. To get limited by this stuff,
> you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit
> Ethernet or something like that - or to be doing a local backup. But I
> thought that it was interesting and that I should share it, so here
> you go! I do wonder if the apparently concurrency problems with ext4
> might matter on systems with high connection counts just in normal
> operation, backups aside.

I have seen such problems. Some of them have gotten better though. For
most (all?) linux filesystems we can easily run into filesystem
concurrency issues from within postgres. There's basically a file level
exclusive lock for buffered writes (only for the copy into the page
cache though), due to posix requirements about the effects of a write
being atomic.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> Why 8kb? That's smaller than what we currently do in pg_basebackup,
> afaictl, and you're actually going to be bottlenecked by syscall
> overhead at that point (unless you disable / don't have the whole intel
> security mitigation stuff).

I just picked something. Could easily try other things.

> > , divided into 1, 2, 4, 8, or 16 equal size files, with each file
> > written by a separate process, and an fsync() at the end before
> > process exit. So in this test, there is no question of whether the
> > master can read the data fast enough, nor is there any issue of
> > network bandwidth. It's purely a test of whether it's faster to have
> > one process write a big file or whether it's faster to have multiple
> > processes each write a smaller file.
>
> That's not necessarily the only question though, right? There's also the
> approach one process writing out multiple files (via buffered, not async
> IO)? E.g. one basebackup connecting to multiple backends, or just
> shuffeling multiple files through one copy stream.

Sure, but that seems like it can't scale better than this. You have
the scaling limitations of the filesystem, plus the possibility that
the process is busy doing something else when it could be writing to
any particular file.

> If you can provide me with the test program, I'd happily run it on some
> decent, but not upper end, NVMe SSDs.

It was attached, but I forgot to mention that in the body of the email.

> I think you might also be seeing some interaction with write caching on
> the raid controller here. The file sizes are small enough to fit in
> there to a significant degree for the single file tests.

Yeah, that's possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Robert Haas
Дата:
On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> > Why 8kb? That's smaller than what we currently do in pg_basebackup,
> > afaictl, and you're actually going to be bottlenecked by syscall
> > overhead at that point (unless you disable / don't have the whole intel
> > security mitigation stuff).
>
> I just picked something. Could easily try other things.

I tried changing the write size to 64kB, keeping the rest the same.
Here are the results:

filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
xfs mag 65 53 64 74 79
ext4 mag 96 68 75 303 437
xfs ssd 75 43 29 33 38
ext4 ssd 96 68 63 214 254
spread spread n/a n/a 43 38 40

And here again are the previous results with an 8kB write size:

xfs mag 97 53 60 67 71
ext4 mag 94 68 66 335 549
xfs ssd 97 55 33 27 25
ext4 ssd 116 70 66 227 450
spread spread n/a n/a 48 42 44

Generally, those numbers look better than the previous numbers, but
parallelism still looks fairly appealing on the SSD storage - less so
on magnetic disks, at least in this test.

Hmm, now I wonder what write size pg_basebackup is actually using.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-05-01 16:32:15 -0400, Robert Haas wrote:
> On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > Why 8kb? That's smaller than what we currently do in pg_basebackup,
> > > afaictl, and you're actually going to be bottlenecked by syscall
> > > overhead at that point (unless you disable / don't have the whole intel
> > > security mitigation stuff).
> >
> > I just picked something. Could easily try other things.
> 
> I tried changing the write size to 64kB, keeping the rest the same.
> Here are the results:
> 
> filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
> xfs mag 65 53 64 74 79
> ext4 mag 96 68 75 303 437
> xfs ssd 75 43 29 33 38
> ext4 ssd 96 68 63 214 254
> spread spread n/a n/a 43 38 40
> 
> And here again are the previous results with an 8kB write size:
> 
> xfs mag 97 53 60 67 71
> ext4 mag 94 68 66 335 549
> xfs ssd 97 55 33 27 25
> ext4 ssd 116 70 66 227 450
> spread spread n/a n/a 48 42 44
> 
> Generally, those numbers look better than the previous numbers, but
> parallelism still looks fairly appealing on the SSD storage - less so
> on magnetic disks, at least in this test.

I spent a fair bit of time analyzing this, and my conclusion is that you
might largely be seeing numa effects. Yay.

I don't have an as large numa machine at hand, but here's what I'm
seeing on my local machine, during a run of writing out 400GiB (this is
a run with noise on the machine, the benchmarks below are without
that). The machine has 192GiB of ram, evenly distributed to two sockets
/ numa domains.


At start I see
numastat -m|grep -E 'MemFree|MemUsed|Dirty|Writeback|Active\(file\)|Inactive\(file\)'"
MemFree                 91908.20        92209.85       184118.05
MemUsed                  3463.05         4553.33         8016.38
Active(file)              105.46          328.52          433.98
Inactive(file)             68.29          190.14          258.43
Dirty                       0.86            0.90            1.76
Writeback                   0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00

For a while there's pretty decent IO throughput (all 10s samples):
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1955.67   2299.32     0.00   0.00   42.48  1203.94
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   82.10  89.33
 

Then it starts to be slower on a sustained basis:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1593.33   1987.85     0.00   0.00   42.90  1277.55
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   67.55  76.53
 

And then performance tanks completely:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00  646.33    781.85     0.00   0.00  132.68  1238.70
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   85.43  58.63
 


That amount of degradation confused me for a while, especially because I
couldn't reproduce it the more controlled I made the setups. In
particular I stopped seeing the same magnitude of issues after pinnning
processes to one numa socket (both running and memory).

After a few seconds:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1882.00   2320.07     0.00   0.00   42.50  1262.35
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   79.05  88.07
 

MemFree                 35356.50        80986.46       116342.96
MemUsed                 60014.75        15776.72        75791.47
Active(file)              179.44          163.28          342.72
Inactive(file)          58293.18        13385.15        71678.33
Dirty                   18407.50          882.00        19289.50
Writeback                 235.78          335.43          571.21
WritebackTmp                0.00            0.00            0.00

A bit later io starts to get slower:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1556.30   1898.70     0.00   0.00   40.92  1249.29
0.00     0.00     0.00   0.00    0.00     0.00    0.20   24.00   62.90  72.01
 

MemFree                   519.56        36086.14        36605.70
MemUsed                 94851.69        60677.04       155528.73
Active(file)              303.84          212.96          516.80
Inactive(file)          92776.70        58133.28       150909.97
Dirty                   10913.20         5374.07        16287.27
Writeback                 812.94          331.96         1144.90
WritebackTmp                0.00            0.00            0.00


And then later it gets worse:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00 1384.70   1671.25     0.00   0.00   40.87  1235.91
0.00     0.00     0.00   0.00    0.00     0.00    0.20    7.00   55.89  63.45
 

MemFree                   519.54          242.98          762.52
MemUsed                 94851.71        96520.20       191371.91
Active(file)              175.82          246.03          421.85
Inactive(file)          92820.19        93985.79       186805.98
Dirty                   10482.75         4140.72        14623.47
Writeback                   0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00

When using a 1s iostat instead of a 10s, it's noticable that performance
swings widely between very slow (<100MB/s) and very high throughput (>
2500MB/s).

It's clearly visible that performance degrades substantially first when
all of a numa node's free memory is exhausted, then when the second numa
node's is.

Looking at profile I see a lot of cacheline bouncing between the kernel
threads that "reclaim" pages (i.e. make them available for reuse), the
kernel threads that write out dirty pages, the kernel threads where the
IO completes (i.e. where the dirty bit can be flipped / locks get
released), and the writing process.

I think there's a lot from the kernel side that can improve - but it's
not too surprising that letting the kernel cache / forcing it to make
caching decisions for a large streaming wide has substantial costs.


I changed Robert's test program to optionall fallocate,
sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
footprint in the page cache. The performance
differences are quite substantial:

gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
    rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
    /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024))
/srv/dev/bench/test1

running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
[/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220

comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
[/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161

Below are the results of running a the program with a variation of
parameters (both file and resutls attached).

I used perf stat in this run to measure the difference in CPU
usage.

ref_cycles are the number of CPU cycles, across all 20 cores / 40
threads, CPUs were doing *something*. It is not affected by CPU
frequency scaling, just by the time CPUs were not "halted". Whereas
cycles is affected by frequency scaling.

A high ref_cycles_sec, combined with a decent number of total
instructions/cycles is *good*, because it indicates fewer CPUs
used. Whereas a very high ref_cycles_tot means that more CPUs were
running doing something for the duration of the benchmark.

The run-to-run variations between the runs without cache control are
pretty large. So this is probably not the end-all-be-all numbers. But I
think the trends are pretty clear.

test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166      0.54
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0     243.609959554   1,802,385,405,203
184.970M/sec   1,449,560,513,247    0.149GHz        855,426,288,031      0.56
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0     230.880100449   1,328,417,418,799
143.846M/sec   1,148,924,667,393    0.124GHz        723,158,246,628      0.63
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1     253.591234992   1,548,485,571,798
152.658M/sec   1,229,926,994,613    0.121GHz        1,117,352,436,324    0.95
 
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0     421.580487642   2,756,486,952,728
163.449M/sec   1,387,708,033,752    0.082GHz        990,478,650,874      0.72
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0     169.854206542   1,333,619,626,854
196.282M/sec   1,036,261,531,134    0.153GHz        666,052,333,591      0.64
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1     305.078100578   1,970,042,289,192
161.445M/sec   1,505,706,462,812    0.123GHz        954,963,240,648      0.62
 
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0     334.932246893   2,366,388,662,460
176.628M/sec   1,216,049,589,993    0.091GHz        796,698,831,717      0.68
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0     161.697270285   1,866,036,713,483
288.576M/sec   1,068,181,502,433    0.165GHz        739,559,279,008      0.70
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1     231.440889430   1,965,389,749,057
212.391M/sec   1,407,927,406,358    0.152GHz        997,199,361,968      0.72
 
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0   644.521613661   3,688,449,404,537
143.079M/sec   2,020,128,131,309    0.078GHz        961,486,630,359      0.48
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0   243.830464632   1,499,608,983,445
153.756M/sec   1,227,468,439,403    0.126GHz        691,534,661,654      0.59
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1   292.866419420   1,753,376,415,877
149.677M/sec   1,483,169,463,392    0.127GHz        860,035,914,148      0.56
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1   162.152397194   925,643,754,128
142.719M/sec   743,208,501,601      0.115GHz        554,462,585,110      0.70
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0   211.369510165   1,558,996,898,599
184.401M/sec   1,359,343,408,200    0.161GHz        766,769,036,524      0.57
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0   233.315094908   1,427,133,080,540
152.927M/sec   1,166,000,868,597    0.125GHz        743,027,329,074      0.64
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1   290.698155820   1,732,849,079,701
149.032M/sec   1,441,508,612,326    0.124GHz        835,039,426,282      0.57
 
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1   159.945462440   850,162,390,626
132.892M/sec   724,286,281,548      0.113GHz        670,069,573,150      0.90
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0   163.244592275   1,524,807,507,173
233.531M/sec   1,398,319,581,978    0.214GHz        689,514,058,243      0.46
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0   231.795934322   1,731,030,267,153
186.686M/sec   1,124,935,745,020    0.121GHz        736,084,922,669      0.70
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1   315.564163702   1,958,199,733,216
155.128M/sec   1,405,115,546,716    0.111GHz        1,000,595,890,394    0.73
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1   210.945487961   1,527,169,148,899
180.990M/sec   906,023,518,692      0.107GHz        700,166,552,207      0.80
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0   161.759094088   1,468,321,054,671
226.934M/sec   1,221,167,105,510    0.189GHz        735,855,415,612      0.59 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0   158.578248952   1,354,770,825,277
213.586M/sec   936,436,363,752      0.148GHz        654,823,079,884      0.68
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1   274.628500801   1,792,841,068,080
163.209M/sec   1,343,398,055,199    0.122GHz        996,073,874,051      0.73
 
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1   179.140070123   1,383,595,004,328
193.095M/sec   850,299,722,091      0.119GHz        706,959,617,654      0.83
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0   445.496787199   2,663,914,572,687
149.495M/sec   1,267,340,496,930    0.071GHz        787,469,552,454      0.62
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0   261.866083604   2,325,884,820,091
222.043M/sec   1,094,814,208,219    0.105GHz        649,479,233,453      0.57
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1   172.963505544   1,717,387,683,260
248.228M/sec   1,356,381,335,831    0.196GHz        822,256,638,370      0.58
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1   157.934678897   1,650,503,807,778
261.266M/sec   970,705,561,971      0.154GHz        637,953,927,131      0.66
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0   225.623143601   1,804,402,820,599
199.938M/sec   1,086,394,788,362    0.120GHz        656,392,112,807      0.62
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0   157.930900998   1,797,506,082,342
284.548M/sec   1,001,509,813,741    0.159GHz        644,107,150,289      0.66
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1   165.772265335   1,805,895,001,689
272.353M/sec   1,514,173,918,970    0.228GHz        823,435,044,810      0.54
 
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1   187.664764448   1,964,118,348,429
261.660M/sec   978,060,510,880      0.130GHz        668,316,194,988      0.67
 


Greetings,

Andres Freund

Вложения

Re: design for parallel backup

От
Robert Haas
Дата:
On Sat, May 2, 2020 at 10:36 PM Andres Freund <andres@anarazel.de> wrote:
> I changed Robert's test program to optionall fallocate,
> sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
> footprint in the page cache. The performance
> differences are quite substantial:
>
> gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
>     rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
>     /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024))
/srv/dev/bench/test1
>
> running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
> [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220
>
> comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
> running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
> [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161

Ah, nice.

> The run-to-run variations between the runs without cache control are
> pretty large. So this is probably not the end-all-be-all numbers. But I
> think the trends are pretty clear.

Could you be explicit about what you think those clear trends are?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-05-03 09:12:59 -0400, Robert Haas wrote:
> On Sat, May 2, 2020 at 10:36 PM Andres Freund <andres@anarazel.de> wrote:
> > I changed Robert's test program to optionall fallocate,
> > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
> > footprint in the page cache. The performance
> > differences are quite substantial:
> >
> > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
> >     rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
> >     /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024))
/srv/dev/bench/test1
> >
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
> > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220
> >
> > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
> > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161
>
> Ah, nice.

Btw, I forgot to include the result for 0 / 0 / 0 in the results
(off-by-one error in a script :))

numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
178.188M/sec   1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 

> > The run-to-run variations between the runs without cache control are
> > pretty large. So this is probably not the end-all-be-all numbers. But I
> > think the trends are pretty clear.
>
> Could you be explicit about what you think those clear trends are?

Largely that concurrency can help a bit, but also hurt
tremendously. Below is some more detailed analysis, it'll be a bit
long...

Taking the no concurrency / cache management as a baseline:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 

and comparing cache management with using some concurrency:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 

we can see very similar timing. Which makes sense, because that's
roughly the device's max speed. But then going to higher concurrency,
there's clearly regressions:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 

And I think it is instructive to look at the
ref_cycles_tot/cycles_tot/instructions_tot vs
ref_cycles_sec/cycles_sec/ipc. The units are confusing because they are
across all cores and most are idle. But it's pretty obvious that
numprocs=1 sfr=1 fadvise=1 has cores running for a lot shorter time
(reference cycles basically count the time cores were running on a
absolute time scale). Compared to numprocs=2 sfr=0 fadvise=0, which has
the same resulting performance, it's clear that cores were busier, but
less efficient (lower ipc).

With cache mangement there's very little benefit, and some risk (1->2
regression) in this workload with increasing concurrency:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 


And there's good benefit, but tremendous risk, of concurrency in the no
cache control case:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 


sync file range without fadvise isn't a benefit at low concurrency, but prevents bad regressions at high concurency:
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166    0.54
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 

fadvise alone is similar:
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 


There does not appear to be a huge of benefit in fallocate in this
workload, the OS's delayed allocation works well. Compare:

numprocs=1
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081      1,569,524,602,961
 178.188M/sec    1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0     243.609959554   1,802,385,405,203
184.970M/sec   1,449,560,513,247    0.149GHz        855,426,288,031      0.56
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166      0.54
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0     230.880100449   1,328,417,418,799
143.846M/sec   1,148,924,667,393    0.124GHz        723,158,246,628      0.63
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1     253.591234992   1,548,485,571,798
152.658M/sec   1,229,926,994,613    0.121GHz        1,117,352,436,324    0.95
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 

numprocs=2
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0     421.580487642   2,756,486,952,728
163.449M/sec   1,387,708,033,752    0.082GHz        990,478,650,874      0.72
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0     169.854206542   1,333,619,626,854
196.282M/sec   1,036,261,531,134    0.153GHz        666,052,333,591      0.64
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1     305.078100578   1,970,042,289,192
161.445M/sec   1,505,706,462,812    0.123GHz        954,963,240,648      0.62
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 

numprocs=4
> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0     334.932246893   2,366,388,662,460
176.628M/sec   1,216,049,589,993    0.091GHz        796,698,831,717      0.68
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0     161.697270285   1,866,036,713,483
288.576M/sec   1,068,181,502,433    0.165GHz        739,559,279,008      0.70
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1     231.440889430   1,965,389,749,057
212.391M/sec   1,407,927,406,358    0.152GHz        997,199,361,968      0.72
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 

I would say that it seems to help concurrent cases without cache
control, but not particularly reliably so. At higher concurrency it
seems to hurt with cache control, not sure I undstand why.


I was at first confused why 128kb write sizes hurt (128kb is probably on
the higher end of useful, but I wanted to have see a more extreme
difference):

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     220.210155081   1,569,524,602,961
178.188M/sec   1,363,686,761,705    0.155GHz        833,345,334,408      0.68
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0   644.521613661   3,688,449,404,537
143.079M/sec   2,020,128,131,309    0.078GHz        961,486,630,359      0.48
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0     248.430736196   1,497,048,950,014
150.653M/sec   1,226,822,167,960    0.123GHz        705,950,461,166      0.54
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0   243.830464632   1,499,608,983,445
153.756M/sec   1,227,468,439,403    0.126GHz        691,534,661,654      0.59
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1     310.275952938   1,921,817,571,226
154.849M/sec   1,499,581,687,133    0.121GHz        944,243,167,053      0.59
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1   292.866419420   1,753,376,415,877
149.677M/sec   1,483,169,463,392    0.127GHz        860,035,914,148      0.56
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1   162.152397194   925,643,754,128
142.719M/sec   743,208,501,601      0.115GHz        554,462,585,110      0.70 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0     243.609959554   1,802,385,405,203
184.970M/sec   1,449,560,513,247    0.149GHz        855,426,288,031      0.56
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0   211.369510165   1,558,996,898,599
184.401M/sec   1,359,343,408,200    0.161GHz        766,769,036,524      0.57
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0     230.880100449   1,328,417,418,799
143.846M/sec   1,148,924,667,393    0.124GHz        723,158,246,628      0.63
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0   233.315094908   1,427,133,080,540
152.927M/sec   1,166,000,868,597    0.125GHz        743,027,329,074      0.64
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1     253.591234992   1,548,485,571,798
152.658M/sec   1,229,926,994,613    0.121GHz        1,117,352,436,324    0.95
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1   290.698155820   1,732,849,079,701
149.032M/sec   1,441,508,612,326    0.124GHz        835,039,426,282      0.57
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1   159.945462440   850,162,390,626
132.892M/sec   724,286,281,548      0.113GHz        670,069,573,150      0.90
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0     164.052510134   1,561,521,537,336
237.972M/sec   1,404,761,167,120    0.214GHz        715,274,337,015      0.51
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0   163.244592275   1,524,807,507,173
233.531M/sec   1,398,319,581,978    0.214GHz        689,514,058,243      0.46
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0     192.151682414   1,526,440,715,456
198.603M/sec   1,037,135,756,007    0.135GHz        802,754,964,096      0.76
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0   231.795934322   1,731,030,267,153
186.686M/sec   1,124,935,745,020    0.121GHz        736,084,922,669      0.70
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1     242.648245159   1,782,637,416,163
183.629M/sec   1,463,696,313,881    0.151GHz        1,000,100,694,932    0.69
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1   315.564163702   1,958,199,733,216
155.128M/sec   1,405,115,546,716    0.111GHz        1,000,595,890,394    0.73
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1   210.945487961   1,527,169,148,899
180.990M/sec   906,023,518,692      0.107GHz        700,166,552,207      0.80
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0     421.580487642   2,756,486,952,728
163.449M/sec   1,387,708,033,752    0.082GHz        990,478,650,874      0.72
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0   161.759094088   1,468,321,054,671
226.934M/sec   1,221,167,105,510    0.189GHz        735,855,415,612      0.59
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0     169.854206542   1,333,619,626,854
196.282M/sec   1,036,261,531,134    0.153GHz        666,052,333,591      0.64
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0   158.578248952   1,354,770,825,277
213.586M/sec   936,436,363,752      0.148GHz        654,823,079,884      0.68
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1     305.078100578   1,970,042,289,192
161.445M/sec   1,505,706,462,812    0.123GHz        954,963,240,648      0.62
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1   274.628500801   1,792,841,068,080
163.209M/sec   1,343,398,055,199    0.122GHz        996,073,874,051      0.73
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1   179.140070123   1,383,595,004,328
193.095M/sec   850,299,722,091      0.119GHz        706,959,617,654      0.83
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0     455.096916715   2,808,715,616,077
154.293M/sec   1,366,660,063,053    0.075GHz        888,512,073,477      0.66
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0   445.496787199   2,663,914,572,687
149.495M/sec   1,267,340,496,930    0.071GHz        787,469,552,454      0.62
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0     256.156100686   2,407,922,637,215
235.003M/sec   1,133,311,037,956    0.111GHz        748,666,206,805      0.65
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0   261.866083604   2,325,884,820,091
222.043M/sec   1,094,814,208,219    0.105GHz        649,479,233,453      0.57
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1     215.255015340   1,977,578,120,924
229.676M/sec   1,461,504,758,029    0.170GHz        1,005,270,838,642    0.68
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1   172.963505544   1,717,387,683,260
248.228M/sec   1,356,381,335,831    0.196GHz        822,256,638,370      0.58
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1   157.934678897   1,650,503,807,778
261.266M/sec   970,705,561,971      0.154GHz        637,953,927,131      0.66
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0     334.932246893   2,366,388,662,460
176.628M/sec   1,216,049,589,993    0.091GHz        796,698,831,717      0.68
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0   225.623143601   1,804,402,820,599
199.938M/sec   1,086,394,788,362    0.120GHz        656,392,112,807      0.62
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0     161.697270285   1,866,036,713,483
288.576M/sec   1,068,181,502,433    0.165GHz        739,559,279,008      0.70
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0   157.930900998   1,797,506,082,342
284.548M/sec   1,001,509,813,741    0.159GHz        644,107,150,289      0.66
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1     231.440889430   1,965,389,749,057
212.391M/sec   1,407,927,406,358    0.152GHz        997,199,361,968      0.72
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1   165.772265335   1,805,895,001,689
272.353M/sec   1,514,173,918,970    0.228GHz        823,435,044,810      0.54
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1   187.664764448   1,964,118,348,429
261.660M/sec   978,060,510,880      0.130GHz        668,316,194,988      0.67
 

It's pretty clear that the larger write block size can hurt quite
badly. I was somewhat confused by this at first, but after thinking
about it for a while longer it actually makes sense: For the OS to
finish an 8k write it needs to find two free pagecache pages. For an
128k write it needs to find 32. Which means that it's much more likely
that kernel threads and the writes are going to fight over locks /
cachelines: In the 8k page it's quite likely that ofen the kernel
threads will do so while the memcpy() from userland is happening, but
that's less the case with 32 pages that need to be acquired before the
memcpy() can happen.

With cache control that problem doesn't exist, which is why the larger
block size is beneficial:

> test                                                                            time            ref_cycles_tot
ref_cycles_sec cycles_tot           cycles_sec      instructions_tot      ipc
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1     164.175492485   913,991,290,231
139.183M/sec   762,359,320,428      0.116GHz        678,451,556,273      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1   162.152397194   925,643,754,128
142.719M/sec   743,208,501,601      0.115GHz        554,462,585,110      0.70
 

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1     164.488835158   911,974,902,254
138.611M/sec   760,756,011,483      0.116GHz        672,105,046,261      0.84
 
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1   159.945462440   850,162,390,626
132.892M/sec   724,286,281,548      0.113GHz        670,069,573,150      0.90
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1     188.772193248   1,418,274,870,697
187.803M/sec   923,133,958,500      0.122GHz        799,212,291,243      0.92
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1   210.945487961   1,527,169,148,899
180.990M/sec   906,023,518,692      0.107GHz        700,166,552,207      0.80
 

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1     166.295223626   1,290,699,256,763
194.044M/sec   857,873,391,283      0.129GHz        761,338,026,415      0.89
 
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1   179.140070123   1,383,595,004,328
193.095M/sec   850,299,722,091      0.119GHz        706,959,617,654      0.83
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1     158.262790654   1,720,443,307,097
271.769M/sec   1,004,079,045,479    0.159GHz        826,905,592,751      0.84
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1   157.934678897   1,650,503,807,778
261.266M/sec   970,705,561,971      0.154GHz        637,953,927,131      0.66
 

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1     214.433248700   2,232,198,239,769
260.300M/sec   1,073,334,918,389    0.125GHz        861,540,079,120      0.80
 
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1   187.664764448   1,964,118,348,429
261.660M/sec   978,060,510,880      0.130GHz        668,316,194,988      0.67
 

Note how especially in the first few cases the total number of
instructions required is improved (although due to the way I did the
perf stat the sampling error is pretty large).


I haven't run that test yest, but after looking at all this I would bet
that reducing the block size to 4kb (i.e. a single os/hw page) would
help the no cache control case significantly, in particular in the
concurrent case.

And conversely, I'd expect that the CPU efficiency will be improved by
larger block size for the cache control case for just about any
realistic block size.


I'd love to have a faster storage available (faster NVMes, or multiple
ones I can use for benchmarking) to see what the cutoff point for
actually benefiting from concurrency is.


Also worthwhile to note that even the "best case" from a CPU usage point
here absolutely *pales* against using direct-IO. It's not an
apples/apples comparison, but comparing buffered io using
write_and_fsync, and unbuffered io using fio:

128KiB blocksize:

write_and_fsync:
echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions
/tmp/write_and_fsync--blocksize $((128*1024)) --sync_file_range=1 --fallocate=1 --fadvise=1 --sequential=0
--filesize=$((400*1024*1024*1024))/srv/dev/bench/test1
 

 Performance counter stats for 'system wide':

      6,377,903.65 msec cpu-clock                 #   39.999 CPUs utilized
   628,014,590,200      ref-cycles                #   98.467 M/sec
   634,468,623,514      cycles                    #    0.099 GHz
   795,771,756,320      instructions              #    1.25  insn per cycle

     159.451492209 seconds time elapsed

fio:
rm -f /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e
cpu-clock,ref-cycles,cycles,instructionsfio --name=test --iodepth=512 --iodepth_low=8 --iodepth_batch_submit=8
--iodepth_batch_complete_min=8--iodepth_batch_complete_max=128 --ioengine=libaio --rw=write --bs=128k
--filesize=$((400*1024*1024*1024)) --direct=1 --numjobs=1
 

 Performance counter stats for 'system wide':

      6,313,522.71 msec cpu-clock                 #   39.999 CPUs utilized
   458,476,185,800      ref-cycles                #   72.618 M/sec
   196,148,015,054      cycles                    #    0.031 GHz
   158,921,457,853      instructions              #    0.81  insn per cycle

     157.842080440 seconds time elapsed

CPU usage for fio most of the time was around 98% for write_and_fsync
and 40% for fio.

I.e. system-wide CPUs were active 0.73x the time, and 0.2x as many
instructions had to be executed in the DIO case.

Greetings,

Andres Freund



Re: design for parallel backup

От
Robert Haas
Дата:
On Sun, May 3, 2020 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:
> > > The run-to-run variations between the runs without cache control are
> > > pretty large. So this is probably not the end-all-be-all numbers. But I
> > > think the trends are pretty clear.
> >
> > Could you be explicit about what you think those clear trends are?
>
> Largely that concurrency can help a bit, but also hurt
> tremendously. Below is some more detailed analysis, it'll be a bit
> long...

OK, thanks. Let me see if I can summarize here. On the strength of
previous experience, you'll probably tell me that some parts of this
summary are wildly wrong or at least "not quite correct" but I'm going
to try my best.

- Server-side compression seems like it has the potential to be a
significant win by stretching bandwidth. We likely need to do it with
10+ parallel threads, at least for stronger compressors, but these
might be threads within a single PostgreSQL process rather than
multiple separate backends.

- Client-side cache management -- that is, use of
posix_fadvise(DONTNEED), posix_fallocate, and sync_file_range, where
available -- looks like it can improve write rates and CPU efficiency
significantly. Larger block sizes show a win when used together with
such techniques.

- The benefits of multiple concurrent connections remain somewhat
elusive. Peter Eisentraut hypothesized upthread that such an approach
might be the most practical way forward for networks with a high
bandwidth-delay product, and I hypothesized that such an approach
might be beneficial when there are multiple tablespaces on independent
disks, but we don't have clear experimental support for those
propositions. Also, both your data and mine indicate that too much
parallelism can lead to major regressions.

- Any work we do while trying to make backup super-fast should also
lend itself to super-fast restore, possibly including parallel
restore. Compressed tarfiles don't permit random access to member
files. Uncompressed tarfiles do, but software that works this way is
not commonplace. The only mainstream archive format that seems to
support random access seems to be zip. Adopting that wouldn't be
crazy, but might limit our choice of compression options more than
we'd like. A tar file of individually compressed files might be a
plausible alternative, though there would probably be some hit to
compression ratios for small files. Then again, if a single,
highly-efficient process can handle a server-to-client backup, maybe
the same is true for extracting a compressed tarfile...

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: design for parallel backup

От
Andres Freund
Дата:
Hi,

On 2020-05-04 14:04:32 -0400, Robert Haas wrote:
> OK, thanks. Let me see if I can summarize here. On the strength of
> previous experience, you'll probably tell me that some parts of this
> summary are wildly wrong or at least "not quite correct" but I'm going
> to try my best.

> - Server-side compression seems like it has the potential to be a
> significant win by stretching bandwidth. We likely need to do it with
> 10+ parallel threads, at least for stronger compressors, but these
> might be threads within a single PostgreSQL process rather than
> multiple separate backends.

That seems right. I think it might be reasonable to just support
"compression parallelism" for zstd, as the library has all the code
internally. So we basically wouldn't have to care about it.


> - Client-side cache management -- that is, use of
> posix_fadvise(DONTNEED), posix_fallocate, and sync_file_range, where
> available -- looks like it can improve write rates and CPU efficiency
> significantly. Larger block sizes show a win when used together with
> such techniques.

Yea. Alternatively direct io, but I am not sure we want to go there for
now.


> - The benefits of multiple concurrent connections remain somewhat
> elusive. Peter Eisentraut hypothesized upthread that such an approach
> might be the most practical way forward for networks with a high
> bandwidth-delay product, and I hypothesized that such an approach
> might be beneficial when there are multiple tablespaces on independent
> disks, but we don't have clear experimental support for those
> propositions. Also, both your data and mine indicate that too much
> parallelism can lead to major regressions.

I think for that we'd basically have to create two high bandwidth nodes
across the pond. My experience in the somewhat recent past is that I
could saturate multi-gbit cross-atlantic links without too much trouble,
at least once I changed sys.net.ipv4.tcp_congestion_control to something
appropriate for such setups (BBR is probably the thing to use here these
days).


> - Any work we do while trying to make backup super-fast should also
> lend itself to super-fast restore, possibly including parallel
> restore.

I'm not sure I see a super clear case for parallel restore in any of the
experiments done so far. The only case we know it's a clear win is when
there's independent filesystems for parts of the data.  There's an
obvious case for parallel decompression however.


> Compressed tarfiles don't permit random access to member files.

This is an issue for selective restores too, not just parallel
restore. I'm not sure how important a case that is, although it'd
certainly be useful if e.g. pg_rewind could read from compressed base
backups.

> Uncompressed tarfiles do, but software that works this way is not
> commonplace.

I am not 100% sure which part you comment on not being commonplace
here. Supporting randomly accessing data in tarfiles?

My understanding of that is that one still has to "skip" through the
entire archive, right? What not being compressed allows is to not have
to read the files inbetween. Given the size of our data files compared
to the metadata size that's probably fine?


> The only mainstream archive format that seems to support random access
> seems to be zip. Adopting that wouldn't be crazy, but might limit our
> choice of compression options more than we'd like.

I'm not sure that's *really* an issue - there's compression format codes
in zip ([1] 4.4.5, also 4.3.14.3 & 4.5 for another approach), and
several tools seem to have used that to add additional compression
methods.


> A tar file of individually compressed files might be a plausible
> alternative, though there would probably be some hit to compression
> ratios for small files.

I'm not entirely sure using zip over
uncompressed-tar-over-compressed-files gains us all that much. AFAIU zip
compresses each file individually. So the advantage would be a more
efficient (less seeking) storage of archive metadata (i.e. which file is
where) and that the metadata could be compressed.


> Then again, if a single, highly-efficient process can handle a
> server-to-client backup, maybe the same is true for extracting a
> compressed tarfile...

Yea. I'd expect that to be the case, at least for the single filesystem
case. Depending on the way multiple tablespaces / filesystems are
handled, it could even be doable to handle that reasonably - but it'd
probably be harder.

Greetings,

Andres Freund

[1] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT