Re: finding changed blocks using WAL scanning

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: finding changed blocks using WAL scanning
Дата
Msg-id 20190423154305.GR6197@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: finding changed blocks using WAL scanning  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: finding changed blocks using WAL scanning  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfrost@snowman.net> wrote:
> >>>>> Oh.  Well, I already explained my algorithm for doing that upthread,
> >>>>> which I believe would be quite cheap.
> >>>>>
> >>>>> 1. When you generate the .modblock files, stick all the block
> >>>>> references into a buffer.  qsort().  Dedup.  Write out in sorted
> >>>>> order.
> >>>>
> >>>>Having all of the block references in a sorted order does seem like it
> >>>>would help, but would also make those potentially quite a bit larger
> >>>>than necessary (I had some thoughts about making them smaller elsewhere
> >>>>in this discussion).  That might be worth it though.  I suppose it might
> >>>>also be possible to line up the bitmaps suggested elsewhere to do
> >>>>essentially a BitmapOr of them to identify the blocks changed (while
> >>>>effectively de-duping at the same time).
> >>>
> >>>I don't see why this would make them bigger than necessary.  If you
> >>>sort by relfilenode/fork/blocknumber and dedup, then references to
> >>>nearby blocks will be adjacent in the file.  You can then decide what
> >>>format will represent that most efficiently on output.  Whether or not
> >>>a bitmap is better idea than a list of block numbers or something else
> >>>depends on what percentage of blocks are modified and how clustered
> >>>they are.
> >>
> >>Not sure I understand correctly - do you suggest to deduplicate and sort
> >>the data before writing them into the .modblock files? Because that the
> >>the sorting would make this information mostly useless for the recovery
> >>prefetching use case I mentioned elsewhere. For that to work we need
> >>information about both the LSN and block, in the LSN order.
> >
> >I'm not sure I follow- why does the prefetching need to get the blocks
> >in LSN order..?  Once the blocks that we know are going to change in the
> >next segment have been identified, we could prefetch them all and have
> >them ready for when replay gets to them.  I'm not sure that we
> >specifically need to have them pre-fetched in the same order that the
> >replay happens and it might even be better to fetch them in an order
> >that's as sequential as possible to get them in as quickly as possible.
>
> That means we'd have to prefetch all blocks for the whole WAL segment,
> which is pretty useless, IMO. A single INSERT (especially for indexes) is
> often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
> we don't want to prefetch all of that at once? And it's even worse for
> larger WAL segments, which are likely to get more common now that it's an
> initdb option.

Ah, yeah, I had been thinking about FPIs and blocks and figuring that
16MB wasn't that bad but you're certainly right that we could end up
touching a lot more blocks in a given WAL segment.

> I'm pretty sure the prefetching needs to be more like "prefetch the next
> 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
> WAL". That doesn't mean we can't do some additional optimization (like
> reordering them a bit), but there's a point where it gets detrimental
> because the kernel will just evict some of the prefetched blocks before we
> actually access them. And the larger amount of blocks you prefetch the
> more likely that is.

Since we're prefetching, maybe we shouldn't be just pulling the blocks
into the filesystem cache but instead into something we have more
control over..?

> >>So if we want to allow that use case to leverage this infrastructure, we
> >>need to write the .modfiles kinda "raw" and do this processing in some
> >>later step.
> >
> >If we really need the LSN info for the blocks, then we could still
> >de-dup, picking the 'first modified in this segment at LSN X', or keep
> >both first and last, or I suppose every LSN if we really want, and then
> >have that information included with the other information about the
> >block.  Downstream clients could then sort based on the LSN info if they
> >want to have a list of blocks in sorted-by-LSN-order.
>
> Possibly. I don't think keeping just the first block occurence is enough,
> particularly for large WAL segmnent sizes, but I agree we can reduce the
> stuff a bit (say, ignore references that are less than 1MB apart or so).
> We just can't remove all the LSN information entirely.

Yeah, it depends on how much memory we want to use for prefetching and
how large the WAL segments are.

> >>Now, maybe the incremental backup use case is so much more important the
> >>right thing to do is ignore this other use case, and I'm OK with that -
> >>as long as it's a conscious choice.
> >
> >I'd certainly like to have a way to prefetch, but I'm not entirely sure
> >that it makes sense to combine it with this, so while I sketched out
> >some ideas about how to do that above, I don't want it to come across as
> >being a strong endorsement of the overall idea.
> >
> >For pre-fetching purposes, for an async streaming replica, it seems like
> >the wal sender process could potentially just scan the WAL and have a
> >list of blocks ready to pass to the replica which are "this is what's
> >coming soon" or similar, rather than working with the modfiles at all.
> >Not sure if we'd always send that or if we wait for the replica to ask
> >for it.  Though for doing WAL replay from the archive, being able to ask
> >for the modfile first to do prefetching before replaying the WAL itself
> >could certainly be beneficial, so maybe it does make sense to have that
> >information there too..  still not sure we really need it in LSN order
> >or that we need to prefetch in LSN order though.
>
> Well, how exactly should the prefetching extract the block list is still
> an open question. But this seems to deal with pretty much the same stuff,
> so it might make sense to support the prefetching use case too. Or maybe
> not, I'm not sure - but IMHO we should try.

Yeah, makes sense to me to at least be thinking about it and seeing if
there's a way to make adding prefetching easier.

> I don't think it makes sense to do prefetching when the standby is fully
> caught up with the primary. It gets more important when it falls behind by
> a significant amount of WAL, at which point it's likely to fetch already
> closed WAL segments - either from primary or archive. So either it can get
> the block list from there, or it may extract the info on it's own (using
> some of the infrastructure used to build the mdblock files).

I don't know..  if we're able to keep up with the primary because we did
a bunch of prefetching and that made WAL replay fast enough that we
catch up, aren't we going to possibly fall right back behind again
pretty quickly if we don't prefetch for the blocks that are about to be
sent?  If we are caught up and there's no data to be sent then we don't
have much choice, but as soon as there's WAL data to be sent our way,
any more than just one block, then grabbing the list of blocks that we
know we're going to need during replay seems to make a lot of sense to
me.  In other words, I'd think we would just always do this and how much
we do would be dictated by both how far behind we are and how much
memory we want to allocate for this.

Thanks!

Stephen

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: finding changed blocks using WAL scanning
Следующее
От: David Fetter
Дата:
Сообщение: Re: doc: datatype-table and xfunc-c-type-table