Обсуждение: Potential stack overflow in incremental base backup

Поиск
Список
Период
Сортировка

Potential stack overflow in incremental base backup

От
Thomas Munro
Дата:
Hi Robert,

I was rebasing my patch to convert RELSEG_SIZE into an initdb-time
setting, and thus runtime variable, and I noticed this stack variable
in src/backend/backup/basebackup_incremental.c:

GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
                                        BlockNumber *relative_block_numbers,
                                        unsigned *truncation_block_length)
{
    BlockNumber absolute_block_numbers[RELSEG_SIZE];

I'll have to move that sucker onto the heap (we banned C99 variable
length arrays and we don't use nonstandard alloca()), but I think the
coding in master is already a bit dodgy: that's 131072 *
sizeof(BlockNumber) = 512kB with default configure options, but
--with-segsize X could create a stack variable up to 16GB,
corresponding to segment size 32TB (meaning no segmentation at all).
That wouldn't work.  Shouldn't we move it to the stack?  See attached
draft patch.

Even on the heap, 16GB is too much to assume we can allocate during a
base backup.  I don't claim that's a real-world problem for
incremental backup right now in master, because I don't have any
evidence that anyone ever really uses --with-segsize (do they?), but
if we make it an initdb option it will be more popular and this will
become a problem.  Hmm.

Вложения

Re: Potential stack overflow in incremental base backup

От
Alvaro Herrera
Дата:
On 2024-Mar-06, Thomas Munro wrote:

> Even on the heap, 16GB is too much to assume we can allocate during a
> base backup.  I don't claim that's a real-world problem for
> incremental backup right now in master, because I don't have any
> evidence that anyone ever really uses --with-segsize (do they?), but
> if we make it an initdb option it will be more popular and this will
> become a problem.  Hmm.

Would it work to use a radix tree from the patchset at
https://postgr.es/m/CANWCAZb43ZNRK03bzftnVRAfHzNGzH26sjc0Ep-sj8+w20VzSg@mail.gmail.com
?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"E pur si muove" (Galileo Galilei)



Re: Potential stack overflow in incremental base backup

От
Robert Haas
Дата:
On Wed, Mar 6, 2024 at 12:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'll have to move that sucker onto the heap (we banned C99 variable
> length arrays and we don't use nonstandard alloca()), but I think the
> coding in master is already a bit dodgy: that's 131072 *
> sizeof(BlockNumber) = 512kB with default configure options, but
> --with-segsize X could create a stack variable up to 16GB,
> corresponding to segment size 32TB (meaning no segmentation at all).
> That wouldn't work.  Shouldn't we move it to the stack?  See attached
> draft patch.
>
> Even on the heap, 16GB is too much to assume we can allocate during a
> base backup.  I don't claim that's a real-world problem for
> incremental backup right now in master, because I don't have any
> evidence that anyone ever really uses --with-segsize (do they?), but
> if we make it an initdb option it will be more popular and this will
> become a problem.  Hmm.

I don't see a problem with moving it from the stack to the heap. I
don't believe I had any particular reason for wanting it on the stack
specifically.

I'm not sure what to do about really large segment sizes. As long as
the allocation here is < ~100MB, it's probably fine, both because (1)
100MB isn't an enormously large allocation these days and (2) if you
have a good reason to increase the segment size by a factor of 256 or
more, you're probably running on a big machine, and then you should
definitely have 100MB to burn.

However, a 32TB segment size is another thing altogether. We could
avoid transposing relative block numbers to absolute block numbers
whenever start_blkno is 0, but that doesn't do a whole lot when the
segment size is 8TB or 16TB rather than 32TB; plus, in such cases, the
relative block number array is also going to be gigantic. Maybe what
we need to do is figure out how to dynamically size the arrays in such
cases, so that you only make them as big as required for the file you
actually have, rather than as big as the file could theoretically be.
But I think that's really only necessary if we're actually going to
get rid of the idea of segmented relations altogether, which I don't
think is happening at least for v17, and maybe not ever.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Potential stack overflow in incremental base backup

От
Robert Haas
Дата:
On Wed, Mar 6, 2024 at 6:29 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2024-Mar-06, Thomas Munro wrote:
> > Even on the heap, 16GB is too much to assume we can allocate during a
> > base backup.  I don't claim that's a real-world problem for
> > incremental backup right now in master, because I don't have any
> > evidence that anyone ever really uses --with-segsize (do they?), but
> > if we make it an initdb option it will be more popular and this will
> > become a problem.  Hmm.
>
> Would it work to use a radix tree from the patchset at
> https://postgr.es/m/CANWCAZb43ZNRK03bzftnVRAfHzNGzH26sjc0Ep-sj8+w20VzSg@mail.gmail.com
> ?

Probably not that much, because we actually send the array to the
client very soon after we construct it:

        push_to_sink(sink, &checksum_ctx, &header_bytes_done,
                     incremental_blocks,
                     sizeof(BlockNumber) * num_incremental_blocks);

This is hard to do without materializing the array somewhere, so I
don't think an alternate representation is the way to go in this
instance.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Potential stack overflow in incremental base backup

От
Thomas Munro
Дата:
On Fri, Mar 8, 2024 at 6:53 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But I think that's really only necessary if we're actually going to
> get rid of the idea of segmented relations altogether, which I don't
> think is happening at least for v17, and maybe not ever.

Yeah, I consider the feedback on ext4's size limitations to have
completely killed the idea of getting rid of segments for the
foreseeable future, at least in standard md.c (though who knows what
people will do with pluggable smgr?).  As for initdb --rel-segsize (CF
#4305) for md.c, I abandoned plans to do that for 17 because I
couldn't see what to do about this issue.  Incremental backup
effectively relies on smaller segments, by using them as
problem-dividing granularity for checksumming and memory usage.
That'll take some research...



Re: Potential stack overflow in incremental base backup

От
Thomas Munro
Дата:
On Fri, Mar 8, 2024 at 6:53 AM Robert Haas <robertmhaas@gmail.com> wrote:
> ... We could
> avoid transposing relative block numbers to absolute block numbers
> whenever start_blkno is 0,  ...

Could we just write the blocks directly into the output array, and
then transpose them directly in place if start_blkno > 0?  See
attached.  I may be missing something, but the only downside I can
think of is that the output array is still clobbered even if we decide
to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
requires a warning in the comment at the top.

Вложения

Re: Potential stack overflow in incremental base backup

От
Robert Haas
Дата:
On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Could we just write the blocks directly into the output array, and
> then transpose them directly in place if start_blkno > 0?  See
> attached.  I may be missing something, but the only downside I can
> think of is that the output array is still clobbered even if we decide
> to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
> requires a warning in the comment at the top.

Yeah. This approach makes the name "relative_block_numbers" a bit
confusing, but not running out of memory is nice, too.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Potential stack overflow in incremental base backup

От
Thomas Munro
Дата:
On Thu, Apr 11, 2024 at 12:11 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Could we just write the blocks directly into the output array, and
> > then transpose them directly in place if start_blkno > 0?  See
> > attached.  I may be missing something, but the only downside I can
> > think of is that the output array is still clobbered even if we decide
> > to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
> > requires a warning in the comment at the top.
>
> Yeah. This approach makes the name "relative_block_numbers" a bit
> confusing, but not running out of memory is nice, too.

Pushed.  That fixes the stack problem.

Of course we still have this:

    /*
     * Since this array is relatively large, avoid putting it on the stack.
     * But we don't need it at all if this is not an incremental backup.
     */
    if (ib != NULL)
        relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);

To rescue my initdb --rel-segsize project[1] for v18, I will have a go
at making that dynamic.  It looks like we don't actually need to
allocate that until we get to the GetFileBackupMethod() call, and at
that point we have the file size.  If I understand correctly,
statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
block number buffer there if required, right?  That way, you would
only need shedloads of virtual memory if you used initdb
--rel-segsize=shedloads and you actually have shedloads of data in a
table/segment.  For example, with initdb --rel-segsize=1TB and a table
that contains 1TB+ of data, that'd allocate 512MB.  It sounds
borderline OK when put that way.  It sounds not OK with
--rel-segsize=infinite and 32TB of data -> palloc(16GB).  I suppose
one weasel-out would be to say we only support segments up to (say)
1TB, until eventually we figure out how to break this code's
dependency on segments.  I guess we'll have to do that one day to
support incremental backups of other smgr implementations that don't
even have segments (segments are a private detail of md.c after all,
not part of the smgr abstraction).

[1] https://commitfest.postgresql.org/48/4305/



Re: Potential stack overflow in incremental base backup

От
Robert Haas
Дата:
On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Pushed.  That fixes the stack problem.

Cool.

> Of course we still have this:
>
>     /*
>      * Since this array is relatively large, avoid putting it on the stack.
>      * But we don't need it at all if this is not an incremental backup.
>      */
>     if (ib != NULL)
>         relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
>
> To rescue my initdb --rel-segsize project[1] for v18, I will have a go
> at making that dynamic.  It looks like we don't actually need to
> allocate that until we get to the GetFileBackupMethod() call, and at
> that point we have the file size.  If I understand correctly,
> statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
> block number buffer there if required, right?

Yes.

> That way, you would
> only need shedloads of virtual memory if you used initdb
> --rel-segsize=shedloads and you actually have shedloads of data in a
> table/segment.  For example, with initdb --rel-segsize=1TB and a table
> that contains 1TB+ of data, that'd allocate 512MB.  It sounds
> borderline OK when put that way.  It sounds not OK with
> --rel-segsize=infinite and 32TB of data -> palloc(16GB).  I suppose
> one weasel-out would be to say we only support segments up to (say)
> 1TB, until eventually we figure out how to break this code's
> dependency on segments.  I guess we'll have to do that one day to
> support incremental backups of other smgr implementations that don't
> even have segments (segments are a private detail of md.c after all,
> not part of the smgr abstraction).

I think the thing to do would be to fetch and send the array of block
numbers in chunks. Instead of calling BlockRefTableEntryGetBlocks()
just once, you'd call it in a loop with a buffer that you know might
not be big enough and then iterate until you've got everything,
sending the partial block array to the client each time. Then you
repeat that loop a second time as you work your way through the giant
file so that you always know what the next block you need to include
in the incremental file is. The only problem here is: exactly how do
you iterate? If BlockRefTableEntryGetBlocks() were guaranteed to
return blocks in order, then you could just keep track of the highest
block number you received from the previous call to that function and
pass the starting block number for the next call as that value plus
one. But as it is, that doesn't work. Whoops. Bad API design on my
part, but fixable.

The other problem here is on the pg_combinebackup side, where we have
a lot of the same issues. reconstruct.c wants to build a map of where
it should get each block before it actually does any I/O. If the whole
file is potentially terabytes in size, I guess that would need to use
some other approach there, too. That's probably doable with a bunch of
rejiggering, but not trivial.

I have to admit that I'm still not at all enthusiastic about 32TB
segments. I think we're going to find that incremental backup is only
the tip of the iceberg. I bet we'll find that there are all kinds of
weird annoyances that pop up with 10TB+ files. It could be outright
lack of support, like, oh, this archive format doesn't support files
larger than X size. But it could also be subtler things like, hey,
this directory copy routine updates the progress bar after each file,
so when some of the files are gigantic, you can no longer count on
having an accurate idea of how much of the directory has been copied.
I suspect that a lot of the code that turns out to have problems will
not be PostgreSQL code (although I expect us to find more problems in
our code, too) so we'll end up telling people "hey, it's not OUR fault
your stuff doesn't work, you just need to file a bug report with
${MAINTAINER_WHO_MAY_OR_MAY_NOT_CARE_ABOUT_30TB_FILES}". And we won't
catch any of the bugs in our own code or anyone else's in the
regression tests, buildfarm, or cfbot, because none of that stuff is
going to test with multi-terabyte files.

I do understand that a 1GB segment size is not that big in 2024, and
that filesystems with a 2GB limit are thought to have died out a long
time ago, and I'm not against using larger segments. I do think,
though, that increasing the segment size by 32768x in one shot is
likely to be overdoing it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Potential stack overflow in incremental base backup

От
Thomas Munro
Дата:
On Tue, Apr 16, 2024 at 4:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > To rescue my initdb --rel-segsize project[1] for v18, I will have a go
> > at making that dynamic.  It looks like we don't actually need to
> > allocate that until we get to the GetFileBackupMethod() call, and at
> > that point we have the file size.  If I understand correctly,
> > statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
> > block number buffer there if required, right?
>
> Yes.

Here is a first attempt at that.  Better factoring welcome.  New
observations made along the way: the current coding can exceed
MaxAllocSize and error out, or overflow 32 bit size_t and allocate
nonsense.  Do you think it'd be better to raise an error explaining
that, or silently fall back to full backup (tried that way in the
attached), or that + log messages?  Another option would be to use
huge allocations, so we only have to deal with that sort of question
for 32 bit systems (i.e. effectively hypothetical/non-relevant
scenarios), but I don't love that idea.

> ...
> I do understand that a 1GB segment size is not that big in 2024, and
> that filesystems with a 2GB limit are thought to have died out a long
> time ago, and I'm not against using larger segments. I do think,
> though, that increasing the segment size by 32768x in one shot is
> likely to be overdoing it.

My intuition is that the primary interesting lines to cross are at 2GB
and 4GB due to data type stuff.  I defend against the most basic
problem in my proposal: I don't let you exceed your off_t type, but
that doesn't mean we don't have off_t/ssize_t/size_t/long snafus
lurking in our code that could bite a 32 bit system with large files.
If you make it past those magic numbers and your tools are all happy,
I think you should be home free until you hit file system limits,
which are effectively unhittable on most systems except ext4's already
bemoaned 16TB limit AFAIK.  But all the same, I'm contemplating
limiting the range to 1TB in the first version, not because of general
fear of unknown unknowns, but just because it means we don't need to
use "huge" allocations for this known place, maybe until we can
address that.

Вложения