Re: Potential stack overflow in incremental base backup

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Potential stack overflow in incremental base backup
Дата
Msg-id CA+TgmobcAkMb8N1h9x9JQ8MZQWnvaeizqG5zVHvW8VaEp=qFaQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Potential stack overflow in incremental base backup  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Potential stack overflow in incremental base backup  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Pushed.  That fixes the stack problem.

Cool.

> Of course we still have this:
>
>     /*
>      * Since this array is relatively large, avoid putting it on the stack.
>      * But we don't need it at all if this is not an incremental backup.
>      */
>     if (ib != NULL)
>         relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
>
> To rescue my initdb --rel-segsize project[1] for v18, I will have a go
> at making that dynamic.  It looks like we don't actually need to
> allocate that until we get to the GetFileBackupMethod() call, and at
> that point we have the file size.  If I understand correctly,
> statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
> block number buffer there if required, right?

Yes.

> That way, you would
> only need shedloads of virtual memory if you used initdb
> --rel-segsize=shedloads and you actually have shedloads of data in a
> table/segment.  For example, with initdb --rel-segsize=1TB and a table
> that contains 1TB+ of data, that'd allocate 512MB.  It sounds
> borderline OK when put that way.  It sounds not OK with
> --rel-segsize=infinite and 32TB of data -> palloc(16GB).  I suppose
> one weasel-out would be to say we only support segments up to (say)
> 1TB, until eventually we figure out how to break this code's
> dependency on segments.  I guess we'll have to do that one day to
> support incremental backups of other smgr implementations that don't
> even have segments (segments are a private detail of md.c after all,
> not part of the smgr abstraction).

I think the thing to do would be to fetch and send the array of block
numbers in chunks. Instead of calling BlockRefTableEntryGetBlocks()
just once, you'd call it in a loop with a buffer that you know might
not be big enough and then iterate until you've got everything,
sending the partial block array to the client each time. Then you
repeat that loop a second time as you work your way through the giant
file so that you always know what the next block you need to include
in the incremental file is. The only problem here is: exactly how do
you iterate? If BlockRefTableEntryGetBlocks() were guaranteed to
return blocks in order, then you could just keep track of the highest
block number you received from the previous call to that function and
pass the starting block number for the next call as that value plus
one. But as it is, that doesn't work. Whoops. Bad API design on my
part, but fixable.

The other problem here is on the pg_combinebackup side, where we have
a lot of the same issues. reconstruct.c wants to build a map of where
it should get each block before it actually does any I/O. If the whole
file is potentially terabytes in size, I guess that would need to use
some other approach there, too. That's probably doable with a bunch of
rejiggering, but not trivial.

I have to admit that I'm still not at all enthusiastic about 32TB
segments. I think we're going to find that incremental backup is only
the tip of the iceberg. I bet we'll find that there are all kinds of
weird annoyances that pop up with 10TB+ files. It could be outright
lack of support, like, oh, this archive format doesn't support files
larger than X size. But it could also be subtler things like, hey,
this directory copy routine updates the progress bar after each file,
so when some of the files are gigantic, you can no longer count on
having an accurate idea of how much of the directory has been copied.
I suspect that a lot of the code that turns out to have problems will
not be PostgreSQL code (although I expect us to find more problems in
our code, too) so we'll end up telling people "hey, it's not OUR fault
your stuff doesn't work, you just need to file a bug report with
${MAINTAINER_WHO_MAY_OR_MAY_NOT_CARE_ABOUT_30TB_FILES}". And we won't
catch any of the bugs in our own code or anyone else's in the
regression tests, buildfarm, or cfbot, because none of that stuff is
going to test with multi-terabyte files.

I do understand that a 1GB segment size is not that big in 2024, and
that filesystems with a 2GB limit are thought to have died out a long
time ago, and I'm not against using larger segments. I do think,
though, that increasing the segment size by 32768x in one shot is
likely to be overdoing it.

--
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: allow changing autovacuum_max_workers without restarting