Обсуждение: Add LSN <-> time conversion functionality
Hi, Elsewhere [1] I required a way to estimate the time corresponding to a particular LSN in the past. I devised the attached LSNTimeline, a data structure mapping LSNs <-> timestamps with decreasing precision for older time, LSN pairs. This can be used to locate and translate a particular time to LSN or vice versa using linear interpolation. I've added an instance of the LSNTimeline to PgStat_WalStats and insert new values to it in background writer's main loop. This patch set also introduces some new pageinspect functions exposing LSN <-> time translations. Outside of being useful to users wondering about the last modification time of a particular block in a relation, the LSNTimeline can be put to use in other Postgres sub-systems to govern behavior based on resource consumption -- using the LSN consumption rate as a proxy. As mentioned in [1], the LSNTimeline is a prerequisite for my implementation of a new freeze heuristic which seeks to freeze only pages which will remain unmodified for a certain amount of wall clock time. But one can imagine other uses for such translation capabilities. The pageinspect additions need a bit more work. I didn't bump the pageinspect version (didn't add the new functions to a new pageinspect version file). I also didn't exercise the new pageinspect functions in a test. I was unsure how to write a test which would be guaranteed not to flake. Because the background writer updates the timeline, it seemed a remote possibility that the time or LSN returned by the functions would be 0 and as such, I'm not sure even a test that SELECT time/lsn > 0 would always pass. I also noticed the pageinspect functions don't have XML id attributes for link discoverability. I planned to add that in a separate commit. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_b3tpbdRPUPh1Q5h35gXhY%3DspH2ssNsEsJ9sDfw6%3DPEAg%40mail.gmail.com
Вложения
On Wed, Dec 27, 2023 at 5:16 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Elsewhere [1] I required a way to estimate the time corresponding to a > particular LSN in the past. I devised the attached LSNTimeline, a data > structure mapping LSNs <-> timestamps with decreasing precision for > older time, LSN pairs. This can be used to locate and translate a > particular time to LSN or vice versa using linear interpolation. Attached is a new version which fixes one overflow danger I noticed in the original patch set. I have also been doing some thinking about the LSNTimeline data structure. Its array elements are combined before all elements have been used. This sacrifices precision earlier than required. I tried some alternative structures that would use the whole array. There are a lot of options, though. Currently each element fits twice as many members as the preceding element. To use the whole array, we'd have to change the behavior from filling each element to its max capacity to something that filled elements only partially. I'm not sure what the best distribution would be. > I've added an instance of the LSNTimeline to PgStat_WalStats and insert > new values to it in background writer's main loop. This patch set also > introduces some new pageinspect functions exposing LSN <-> time > translations. I was thinking that maybe it is silly to have the functions allowing for translation between LSN and time in the pageinspect extension -- since they are not specifically related to pages (pages are just an object that has an accessible LSN). I was thinking perhaps we add them as system information functions. However, the closest related functions I can think of are those to get the current LSN (like pg_current_wal_lsn ()). And those are listed as system administration functions under backup control [1]. I don't think the LSN <-> time functionality fits under backup control. If I did put them in one of the system information function sections [2], which one would work best? - Melanie [1] https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-BACKUP [2] https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO
Вложения
Hi, I took a look at this today, to try to understand the purpose and how it works. Let me share some initial thoughts and questions I have. Some of this may be wrong/missing the point, so apologies for that. The goal seems worthwhile in general - the way I understand it, the patch aims to provide tracking of WAL "velocity", i.e. how much WAL was generated over time. Which we now don't have, as we only maintain simple cumulative stats/counters. And then uses it to estimate timestamp for a given LSN, and vice versa, because that's what the pruning patch needs. When I first read this, I immediately started wondering if this might use the commit timestamp stuff we already have. Because for each commit we already store the LSN and commit timestamp, right? But I'm not sure that would be a good match - the commit_ts serves a very special purpose of mapping XID => (LSN, timestamp), I don't see how to make it work for (LSN=>timestmap) and (timestamp=>LSN) very easily. As for the inner workings of the patch, my understanding is this: - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts) points, but those points are really "buckets" that grow larger and larger for older periods of time. - The entries are being added from bgwriter, i.e. on each loop we add the current (LSN, timestamp) into the timeline. - We then estimate LSN/timestamp using the data stored in LSNTimeline (either LSN => timestamp, or the opposite direction). Some comments in arbitrary order: - AFAIK each entry represent an interval of time, and the next (older) interval is twice as long, right? So the first interval is 1 second, then 2 seconds, 4 seconds, 8 seconds, ... - But I don't understand how the LSNTimeline entries are "aging" and get less accurate, while the "current" bucket is short. lsntime_insert() seems to simply move to the next entry, but doesn't that mean we insert the entries into larger and larger buckets? - The comments never really spell what amount of time the entries cover / how granular it is. My understanding is it's simply measured in number of entries added, which is assumed to be constant and drive by bgwriter_delay, right? Which is 200ms by default. Which seems fine, but isn't the hibernation (HIBERNATE_FACTOR) going to mess with it? Is there some case where bgwriter would just loop without sleeping, filling the timeline much faster? (I can't think of any, but ...) - The LSNTimeline comment claims an array of size 64 is large enough to not need to care about filling it, but maybe it should briefly explain why we can never fill it (I guess 2^64 is just too many). - I don't quite understand why 0005 adds the functions to pageinspect. This has nothing to do with pages, right? - Not sure why we need 0001. Just so that the "estimate" functions in 0002 have a convenient "start" point? Surely we could look at the current LSNTimeline data and use the oldest value, or (if there's no data) use the current timestamp/LSN? - I wonder what happens if we lose the data - we know that if people reset statistics for whatever reason (or just lose them because of a crash, or because they're on a replica), bad things happen to autovacuum. What's the (expected) impact on pruning? - What about a SRF function that outputs the whole LSNTimeline? Would be useful for debugging / development, I think. (Just a suggestion). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks so much for reviewing! On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > When I first read this, I immediately started wondering if this might > use the commit timestamp stuff we already have. Because for each commit > we already store the LSN and commit timestamp, right? But I'm not sure > that would be a good match - the commit_ts serves a very special purpose > of mapping XID => (LSN, timestamp), I don't see how to make it work for > (LSN=>timestmap) and (timestamp=>LSN) very easily. I took a look at the code in commit_ts.c, and I really don't see a way of reusing any of this commit<->timestamp infrastructure for timestamp<->LSN mappings. > As for the inner workings of the patch, my understanding is this: > > - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts) > points, but those points are really "buckets" that grow larger and > larger for older periods of time. Yes, they are buckets in the sense that they represent multiple values but each contains a single LSNTime value which is the minimum of all the LSNTimes we "merged" into that single array element. In order to represent a range of time, you need to use two array elements. The linear interpolation from time <-> LSN is all done with two elements. > - AFAIK each entry represent an interval of time, and the next (older) > interval is twice as long, right? So the first interval is 1 second, > then 2 seconds, 4 seconds, 8 seconds, ... > > - But I don't understand how the LSNTimeline entries are "aging" and get > less accurate, while the "current" bucket is short. lsntime_insert() > seems to simply move to the next entry, but doesn't that mean we insert > the entries into larger and larger buckets? Because the earlier array elements can represent fewer logical members than later ones and because elements are merged into the next element when space runs out, later array elements will contain older data and more of it, so those "ranges" will be larger. But, after thinking about it and also reading your feedback, I realized my algorithm was silly because it starts merging logical members before it has even used the whole array. The attached v3 has a new algorithm. Now, LSNTimes are added from the end of the array forward until all array elements have at least one logical member (array length == volume). Once array length == volume, new LSNTimes will result in merging logical members in existing elements. We want to merge older members because those can be less precise. So, the number of logical members per array element will always monotonically increase starting from the beginning of the array (which contains the most recent data) and going to the end. We want to use all the available space in the array. That means that each LSNTime insertion will always result in a single merge. We want the timeline to be inclusive of the oldest data, so merging means taking the smaller value of two LSNTime values. I had to pick a rule for choosing which elements to merge. So, I choose the merge target as the oldest element whose logical membership is < 2x its predecessor. I merge the merge target's predecessor into the merge target. Then I move all of the intervening elements down 1. Then I insert the new LSNTime at index 0. > - The comments never really spell what amount of time the entries cover > / how granular it is. My understanding is it's simply measured in number > of entries added, which is assumed to be constant and drive by > bgwriter_delay, right? Which is 200ms by default. Which seems fine, but > isn't the hibernation (HIBERNATE_FACTOR) going to mess with it? > > Is there some case where bgwriter would just loop without sleeping, > filling the timeline much faster? (I can't think of any, but ...) bgwriter will wake up when there are buffers to flush, which is likely correlated with there being new LSNs. So, actually it seems like it might work well to rely on only filling the timeline when there are things for bgwriter to do. > - The LSNTimeline comment claims an array of size 64 is large enough to > not need to care about filling it, but maybe it should briefly explain > why we can never fill it (I guess 2^64 is just too many). The new structure fits a different number of members. I have yet to calculate that number, but it should be explained in the comments once I do. For example, if we made an LSNTimeline with volume 4, once every element had one LSNTime and we needed to start merging, the following is how many logical members each element would have after each of four merges: 1111 1112 1122 1114 1124 So, if we store the number of members as an unsigned 64-bit int and we have an LSNTimeline with volume 64, what is the maximum number of members can we store if we hold all of the invariants described in my algorithm above (we only merge when required, every element holds < 2x the number of logical members as its predecessor, we do exactly one merge every insertion [when required], membership must monotonically increase [choose the oldest element meeting the criteria when deciding what to merge])? > - I don't quite understand why 0005 adds the functions to pageinspect. > This has nothing to do with pages, right? You're right. I just couldn't think of a good place to put the functions. In version 3, I just put the SQL functions in pgstat_wal.c and made them generally available (i.e. not in a contrib module). I haven't added docs back yet. But perhaps a section near the docs describing pg_xact_commit_timestamp() [1]? I wasn't sure if I should put the SQL function source code in pgstatfuncs.c -- I kind of prefer it in pgstat_wal.c but there are no other SQL functions there. > - Not sure why we need 0001. Just so that the "estimate" functions in > 0002 have a convenient "start" point? Surely we could look at the > current LSNTimeline data and use the oldest value, or (if there's no > data) use the current timestamp/LSN? When there are 0 or 1 entries in the timeline you'll get an answer that could be very off if you just return the current timestamp or LSN. I guess that is okay? > - I wonder what happens if we lose the data - we know that if people > reset statistics for whatever reason (or just lose them because of a > crash, or because they're on a replica), bad things happen to > autovacuum. What's the (expected) impact on pruning? This is an important question. Because stats aren't crashsafe, we could return very inaccurate numbers for some time/LSN values if we crash. I don't actually know what we could do about that. When I use the LSNTimeline for the freeze heuristic it is less of an issue because the freeze heuristic has a fallback strategy when there aren't enough stats to do its calculations. But other users of this LSNTimeline will simply get highly inaccurate results (I think?). Is there anything we could do about this? It seems bad. Andres had brought up something at some point about, what if the database is simply turned off for awhile and then turned back on. Even if you cleanly shut down, will there be "gaps" in the timeline? I think that could be okay, but it is something to think about. > - What about a SRF function that outputs the whole LSNTimeline? Would be > useful for debugging / development, I think. (Just a suggestion). Good idea! I've added this. Though, maybe there was a simpler way to implement than I did. Just a note, all of my comments could use a lot of work, but I want to get consensus on the algorithm before I make sure and write about it in a perfect way. - Melanie [1] https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-COMMIT-TIMESTAMP
Вложения
> On 22 Feb 2024, at 03:45, Melanie Plageman <melanieplageman@gmail.com> wrote: > On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> - Not sure why we need 0001. Just so that the "estimate" functions in >> 0002 have a convenient "start" point? Surely we could look at the >> current LSNTimeline data and use the oldest value, or (if there's no >> data) use the current timestamp/LSN? > > When there are 0 or 1 entries in the timeline you'll get an answer > that could be very off if you just return the current timestamp or > LSN. I guess that is okay? I don't think that's a huge problem at such a young "lsn-age", but I might be missing something. >> - I wonder what happens if we lose the data - we know that if people >> reset statistics for whatever reason (or just lose them because of a >> crash, or because they're on a replica), bad things happen to >> autovacuum. What's the (expected) impact on pruning? > > This is an important question. Because stats aren't crashsafe, we > could return very inaccurate numbers for some time/LSN values if we > crash. I don't actually know what we could do about that. When I use > the LSNTimeline for the freeze heuristic it is less of an issue > because the freeze heuristic has a fallback strategy when there aren't > enough stats to do its calculations. But other users of this > LSNTimeline will simply get highly inaccurate results (I think?). Is > there anything we could do about this? It seems bad. A complication with this over stats is that we can't recompute this in case of a crash/corruption issue. The simple solution is to consider this unlogged data and start fresh at every unclean shutdown, but I have a feeling that won't be good enough for basing heuristics on? > Andres had brought up something at some point about, what if the > database is simply turned off for awhile and then turned back on. Even > if you cleanly shut down, will there be "gaps" in the timeline? I > think that could be okay, but it is something to think about. The gaps would represent reality, so there is nothing wrong per se with gaps, but if they inflate the interval of a bucket which in turns impact the precision of the results then that can be a problem. > Just a note, all of my comments could use a lot of work, but I want to > get consensus on the algorithm before I make sure and write about it > in a perfect way. I'm not sure "a lot of work" is accurate, they seem pretty much there to me, but I think that an illustration of running through the algorithm in an ascii-art array would be helpful. Reading through this I think such a function has merits, not only for your usecase but other heuristic based work and quite possibly systems debugging. While the bucketing algorithm is a clever algorithm for degrading precision for older entries without discarding them, I do fear that we'll risk ending up with answers like "somewhere between in the past and even further in the past". I've been playing around with various compression algorithms for packing the buckets such that we can retain precision for longer. Since you were aiming to work on other patches leading up to the freeze, let's pick this up again once things calm down. When compiling I got this warning for lsntime_merge_target: pgstat_wal.c:242:1: warning: non-void function does not return a value in all control paths [-Wreturn-type] } ^ 1 warning generated. The issue seems to be that the can't-really-happen path is protected with an Assertion, which is a no-op for production builds. I think we should handle the error rather than rely on testing catching it (since if it does happen even though it can't, it's going to be when it's for sure not tested). Returning an invalid array subscript like -1 and testing for it in lsntime_insert, with an elog(WARNING..), seems enough. - last_snapshot_lsn <= GetLastImportantRecPtr()) + last_snapshot_lsn <= current_lsn) I think we should delay extracting the LSN with GetLastImportantRecPtr until we know that we need it, to avoid taking locks in this codepath unless needed. I've attached a diff with the above suggestions which applies on op of your patchset. -- Daniel Gustafsson
Вложения
On 2/22/24 03:45, Melanie Plageman wrote: > Thanks so much for reviewing! > > On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> When I first read this, I immediately started wondering if this might >> use the commit timestamp stuff we already have. Because for each commit >> we already store the LSN and commit timestamp, right? But I'm not sure >> that would be a good match - the commit_ts serves a very special purpose >> of mapping XID => (LSN, timestamp), I don't see how to make it work for >> (LSN=>timestmap) and (timestamp=>LSN) very easily. > > I took a look at the code in commit_ts.c, and I really don't see a way > of reusing any of this commit<->timestamp infrastructure for > timestamp<->LSN mappings. > >> As for the inner workings of the patch, my understanding is this: >> >> - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts) >> points, but those points are really "buckets" that grow larger and >> larger for older periods of time. > > Yes, they are buckets in the sense that they represent multiple values > but each contains a single LSNTime value which is the minimum of all > the LSNTimes we "merged" into that single array element. In order to > represent a range of time, you need to use two array elements. The > linear interpolation from time <-> LSN is all done with two elements. > >> - AFAIK each entry represent an interval of time, and the next (older) >> interval is twice as long, right? So the first interval is 1 second, >> then 2 seconds, 4 seconds, 8 seconds, ... >> >> - But I don't understand how the LSNTimeline entries are "aging" and get >> less accurate, while the "current" bucket is short. lsntime_insert() >> seems to simply move to the next entry, but doesn't that mean we insert >> the entries into larger and larger buckets? > > Because the earlier array elements can represent fewer logical members > than later ones and because elements are merged into the next element > when space runs out, later array elements will contain older data and > more of it, so those "ranges" will be larger. But, after thinking > about it and also reading your feedback, I realized my algorithm was > silly because it starts merging logical members before it has even > used the whole array. > > The attached v3 has a new algorithm. Now, LSNTimes are added from the > end of the array forward until all array elements have at least one > logical member (array length == volume). Once array length == volume, > new LSNTimes will result in merging logical members in existing > elements. We want to merge older members because those can be less > precise. So, the number of logical members per array element will > always monotonically increase starting from the beginning of the array > (which contains the most recent data) and going to the end. We want to > use all the available space in the array. That means that each LSNTime > insertion will always result in a single merge. We want the timeline > to be inclusive of the oldest data, so merging means taking the > smaller value of two LSNTime values. I had to pick a rule for choosing > which elements to merge. So, I choose the merge target as the oldest > element whose logical membership is < 2x its predecessor. I merge the > merge target's predecessor into the merge target. Then I move all of > the intervening elements down 1. Then I insert the new LSNTime at > index 0. > I can't help but think about t-digest [1], which also merges data into variable-sized buckets (called centroids, which is a pair of values, just like LSNTime). But the merging is driven by something called "scale function" which I found like a pretty nice approach to this, and it yields some guarantees regarding accuracy. I wonder if we could do something similar here ... The t-digest is a way to approximate quantiles, and the default scale function is optimized for best accuracy on the extremes (close to 0.0 and 1.0), but it's possible to use scale functions that optimize only for accuracy close to 1.0. This may be misguided, but I see similarity between quantiles and what LSNTimeline does - timestamps are ordered, and quantiles close to 0.0 are "old timestamps" while quantiles close to 1.0 are "now". And t-digest also defines a pretty efficient algorithm to merge data in a way that gradually combines older "buckets" into larger ones. >> - The comments never really spell what amount of time the entries cover >> / how granular it is. My understanding is it's simply measured in number >> of entries added, which is assumed to be constant and drive by >> bgwriter_delay, right? Which is 200ms by default. Which seems fine, but >> isn't the hibernation (HIBERNATE_FACTOR) going to mess with it? >> >> Is there some case where bgwriter would just loop without sleeping, >> filling the timeline much faster? (I can't think of any, but ...) > > bgwriter will wake up when there are buffers to flush, which is likely > correlated with there being new LSNs. So, actually it seems like it > might work well to rely on only filling the timeline when there are > things for bgwriter to do. > >> - The LSNTimeline comment claims an array of size 64 is large enough to >> not need to care about filling it, but maybe it should briefly explain >> why we can never fill it (I guess 2^64 is just too many). > > The new structure fits a different number of members. I have yet to > calculate that number, but it should be explained in the comments once > I do. > > For example, if we made an LSNTimeline with volume 4, once every > element had one LSNTime and we needed to start merging, the following > is how many logical members each element would have after each of four > merges: > 1111 > 1112 > 1122 > 1114 > 1124 > So, if we store the number of members as an unsigned 64-bit int and we > have an LSNTimeline with volume 64, what is the maximum number of > members can we store if we hold all of the invariants described in my > algorithm above (we only merge when required, every element holds < 2x > the number of logical members as its predecessor, we do exactly one > merge every insertion [when required], membership must monotonically > increase [choose the oldest element meeting the criteria when deciding > what to merge])? > I guess that should be enough for (2^64-1) logical members, because it's a sequence 1, 2, 4, 8, ..., 2^63. Seems enough. But now that I think about it, does it make sense to do the merging based on the number of logical members? Shouldn't this really be driven by the "length" of the time interval the member covers? >> - I don't quite understand why 0005 adds the functions to pageinspect. >> This has nothing to do with pages, right? > > You're right. I just couldn't think of a good place to put the > functions. In version 3, I just put the SQL functions in pgstat_wal.c > and made them generally available (i.e. not in a contrib module). I > haven't added docs back yet. But perhaps a section near the docs > describing pg_xact_commit_timestamp() [1]? I wasn't sure if I should > put the SQL function source code in pgstatfuncs.c -- I kind of prefer > it in pgstat_wal.c but there are no other SQL functions there. > OK, pgstat_wal seems like a good place. >> - Not sure why we need 0001. Just so that the "estimate" functions in >> 0002 have a convenient "start" point? Surely we could look at the >> current LSNTimeline data and use the oldest value, or (if there's no >> data) use the current timestamp/LSN? > > When there are 0 or 1 entries in the timeline you'll get an answer > that could be very off if you just return the current timestamp or > LSN. I guess that is okay? > >> - I wonder what happens if we lose the data - we know that if people >> reset statistics for whatever reason (or just lose them because of a >> crash, or because they're on a replica), bad things happen to >> autovacuum. What's the (expected) impact on pruning? > > This is an important question. Because stats aren't crashsafe, we > could return very inaccurate numbers for some time/LSN values if we > crash. I don't actually know what we could do about that. When I use > the LSNTimeline for the freeze heuristic it is less of an issue > because the freeze heuristic has a fallback strategy when there aren't > enough stats to do its calculations. But other users of this > LSNTimeline will simply get highly inaccurate results (I think?). Is > there anything we could do about this? It seems bad. > > Andres had brought up something at some point about, what if the > database is simply turned off for awhile and then turned back on. Even > if you cleanly shut down, will there be "gaps" in the timeline? I > think that could be okay, but it is something to think about. > >> - What about a SRF function that outputs the whole LSNTimeline? Would be >> useful for debugging / development, I think. (Just a suggestion). > > Good idea! I've added this. Though, maybe there was a simpler way to > implement than I did. > Thanks. I'll take a look. > Just a note, all of my comments could use a lot of work, but I want to > get consensus on the algorithm before I make sure and write about it > in a perfect way. > Makes sense, as long as the comments are sufficiently clear. It's hard to reach consensus on something not explained clearly enough. regards [1] https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/18/24 15:02, Daniel Gustafsson wrote: >> On 22 Feb 2024, at 03:45, Melanie Plageman <melanieplageman@gmail.com> wrote: >> On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: > >>> - Not sure why we need 0001. Just so that the "estimate" functions in >>> 0002 have a convenient "start" point? Surely we could look at the >>> current LSNTimeline data and use the oldest value, or (if there's no >>> data) use the current timestamp/LSN? >> >> When there are 0 or 1 entries in the timeline you'll get an answer >> that could be very off if you just return the current timestamp or >> LSN. I guess that is okay? > > I don't think that's a huge problem at such a young "lsn-age", but I might be > missing something. > >>> - I wonder what happens if we lose the data - we know that if people >>> reset statistics for whatever reason (or just lose them because of a >>> crash, or because they're on a replica), bad things happen to >>> autovacuum. What's the (expected) impact on pruning? >> >> This is an important question. Because stats aren't crashsafe, we >> could return very inaccurate numbers for some time/LSN values if we >> crash. I don't actually know what we could do about that. When I use >> the LSNTimeline for the freeze heuristic it is less of an issue >> because the freeze heuristic has a fallback strategy when there aren't >> enough stats to do its calculations. But other users of this >> LSNTimeline will simply get highly inaccurate results (I think?). Is >> there anything we could do about this? It seems bad. > Do we have something to calculate a sufficiently good "average" to use as a default, if we don't have a better value? For example, we know the timestamp of the last checkpoint, and we know the LSN, right? Maybe if we're sufficiently far from the checkpoint, we could use that. Or maybe checkpoint_timeout / max_wal_size would be enough to calculate some default value? I wonder how long it takes until LSNTimeline gives us sufficiently good data for all LSNs we need. That is, if we lose this, how long it takes until we get enough data to do good decisions? Why don't we simply WAL-log this in some trivial way? It's pretty small, so if we WAL-log this once in a while (after a merge happens), that should not be a problem. Or a different idea - if we lost the data, but commit_ts is enabled, can't we "simply" walk commit_ts and feed LSN/timestamp into the timeline? I guess we don't want to walk 2B transactions, but even just sampling some recent transactions might be enough, no? > A complication with this over stats is that we can't recompute this in case of > a crash/corruption issue. The simple solution is to consider this unlogged > data and start fresh at every unclean shutdown, but I have a feeling that won't > be good enough for basing heuristics on? > >> Andres had brought up something at some point about, what if the >> database is simply turned off for awhile and then turned back on. Even >> if you cleanly shut down, will there be "gaps" in the timeline? I >> think that could be okay, but it is something to think about. > > The gaps would represent reality, so there is nothing wrong per se with gaps, > but if they inflate the interval of a bucket which in turns impact the > precision of the results then that can be a problem. > Well, I think the gaps are a problem in the sense that they disappear once we start merging the buckets. But maybe that's fine, if we're only interested in approximate data. >> Just a note, all of my comments could use a lot of work, but I want to >> get consensus on the algorithm before I make sure and write about it >> in a perfect way. > > I'm not sure "a lot of work" is accurate, they seem pretty much there to me, > but I think that an illustration of running through the algorithm in an > ascii-art array would be helpful. > +1 > > Reading through this I think such a function has merits, not only for your > usecase but other heuristic based work and quite possibly systems debugging. > > While the bucketing algorithm is a clever algorithm for degrading precision for > older entries without discarding them, I do fear that we'll risk ending up with > answers like "somewhere between in the past and even further in the past". > I've been playing around with various compression algorithms for packing the > buckets such that we can retain precision for longer. Since you were aiming to > work on other patches leading up to the freeze, let's pick this up again once > things calm down. > I guess this ambiguity is pretty inherent to a structure that does not keep all the data, and gradually reduces the resolution for old stuff. But my understanding was that's sufficient for the freezing patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi everyone! Me, Bharath, and Ilya are on patch review session at the PGConf.dev :) Maybe we got everything wrong, please consider thatwe are just doing training on reviewing patches. === Purpose of the patch === Currently, we have checkpoint_timeout and max_wal size to know when we need a checkpoint. This patch brings a capabilityto freeze page not only by internal state of the system, but also by wall clock time. To do so we need an infrastructure which will tell when page was modified. The patch in this thread is doing exactly this: in-memory information to map LSNs with wall clock time. Mapping is maintainedby bacgroundwriter. === Questions === 1. The patch does not handle server restart. All pages will need freeze after any crash? 2. Some benchmarks to proof the patch does not have CPU footprint. === Nits === "Timeline" term is already taken. The patch needs rebase due to some header changes. Tests fail on Windows. The patch lacks tests. Some docs would be nice, but the feature is for developers. Mapping is protected for multithreaded access by walstats LWlock and might have tuplestore_putvalues() under that lock. Thatmight be a little dangerous, if tuplestore will be on-disk for some reason (should not happen). Overall, the patch is a base for good feature which would help to do freeze right in time. Thanks! Best regards, Bharath, Andrey, Ilya.
On Mon, Mar 18, 2024 at 1:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 2/22/24 03:45, Melanie Plageman wrote: > > The attached v3 has a new algorithm. Now, LSNTimes are added from the > > end of the array forward until all array elements have at least one > > logical member (array length == volume). Once array length == volume, > > new LSNTimes will result in merging logical members in existing > > elements. We want to merge older members because those can be less > > precise. So, the number of logical members per array element will > > always monotonically increase starting from the beginning of the array > > (which contains the most recent data) and going to the end. We want to > > use all the available space in the array. That means that each LSNTime > > insertion will always result in a single merge. We want the timeline > > to be inclusive of the oldest data, so merging means taking the > > smaller value of two LSNTime values. I had to pick a rule for choosing > > which elements to merge. So, I choose the merge target as the oldest > > element whose logical membership is < 2x its predecessor. I merge the > > merge target's predecessor into the merge target. Then I move all of > > the intervening elements down 1. Then I insert the new LSNTime at > > index 0. > > > > I can't help but think about t-digest [1], which also merges data into > variable-sized buckets (called centroids, which is a pair of values, > just like LSNTime). But the merging is driven by something called "scale > function" which I found like a pretty nice approach to this, and it > yields some guarantees regarding accuracy. I wonder if we could do > something similar here ... > > The t-digest is a way to approximate quantiles, and the default scale > function is optimized for best accuracy on the extremes (close to 0.0 > and 1.0), but it's possible to use scale functions that optimize only > for accuracy close to 1.0. > > This may be misguided, but I see similarity between quantiles and what > LSNTimeline does - timestamps are ordered, and quantiles close to 0.0 > are "old timestamps" while quantiles close to 1.0 are "now". > > And t-digest also defines a pretty efficient algorithm to merge data in > a way that gradually combines older "buckets" into larger ones. I started taking a look at this paper and think the t-digest could be applicable as a possible alternative data structure to the one I am using to approximate page age for the actual opportunistic freeze heuristic -- especially since the centroids are pairs of a mean and a count. I couldn't quite understand how the t-digest is combining those centroids. Since I am not computing quantiles over the LSNTimeStream, though, I think I can probably do something simpler for this part of the project. > >> - The LSNTimeline comment claims an array of size 64 is large enough to > >> not need to care about filling it, but maybe it should briefly explain > >> why we can never fill it (I guess 2^64 is just too many). -- snip -- > I guess that should be enough for (2^64-1) logical members, because it's > a sequence 1, 2, 4, 8, ..., 2^63. Seems enough. > > But now that I think about it, does it make sense to do the merging > based on the number of logical members? Shouldn't this really be driven > by the "length" of the time interval the member covers? After reading this, I decided to experiment with a few different algorithms in python and plot the unabbreviated LSNTimeStream against various ways of compressing it. You can see the results if you run the python code here [1]. What I found is that attempting to calculate the error represented by dropping a point and picking the point which would cause the least additional error were it to be dropped produced more accurate results than combining the oldest entries based on logical membership to fit some series. This is inspired by what you said about using the length of segments to decide which points to merge. In my case, I want to merge segments that have a similar slope -- those which have a point that is essentially redundant. I loop through the LSNTimeStream and look for the point that we can drop with the lowest impact on future interpolation accuracy. To do this, I calculate the area of the triangle made by each point on the stream and its adjacent points. The idea being that if you drop that point, the triangle is the amount of error you introduce for points being interpolated between the new pair (previous adjacencies of the dropped point). This has some issues, but it seems more logical than just always combining the oldest points. If you run the python simulator code, you'll see that for the LSNTimeStream I generated, using this method produces more accurate results than either randomly dropping points or using the "combine oldest members" method. It would be nice if we could do something with the accumulated error -- so we could use it to modify estimates when interpolating. I don't really know how to keep it though. I thought I would just save the calculated error in one or the other of the adjacent points after dropping a point, but then what do we do with the error saved in a point before it is dropped? Add it to the error value in one of the adjacent points? If we did, what would that even mean? How would we use it? - Melanie [1] https://gist.github.com/melanieplageman/95126993bcb43d4b4042099e9d0ccc11
Thanks for the review! Attached v4 implements the new algorithm/compression described in [1]. We had discussed off-list possibly using error in some way. So, I'm interested to know what you think about this method I suggested which calculates error. It doesn't save the error so that we could use it when interpolating for reasons I describe in that mail. If you have any ideas on how to use the calculated error or just how to combine error when dropping a point, that would be super helpful. Note that in this version, I've changed the name from LSNTimeline to LSNTimeStream to address some feedback from another reviewer about Timeline being already in use in Postgres as a concept. On Mon, Mar 18, 2024 at 10:02 AM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 22 Feb 2024, at 03:45, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> - I wonder what happens if we lose the data - we know that if people > >> reset statistics for whatever reason (or just lose them because of a > >> crash, or because they're on a replica), bad things happen to > >> autovacuum. What's the (expected) impact on pruning? > > > > This is an important question. Because stats aren't crashsafe, we > > could return very inaccurate numbers for some time/LSN values if we > > crash. I don't actually know what we could do about that. When I use > > the LSNTimeline for the freeze heuristic it is less of an issue > > because the freeze heuristic has a fallback strategy when there aren't > > enough stats to do its calculations. But other users of this > > LSNTimeline will simply get highly inaccurate results (I think?). Is > > there anything we could do about this? It seems bad. > > A complication with this over stats is that we can't recompute this in case of > a crash/corruption issue. The simple solution is to consider this unlogged > data and start fresh at every unclean shutdown, but I have a feeling that won't > be good enough for basing heuristics on? Yes, I still haven't dealt with this yet. Tomas had a list of suggestions in an upthread email, so I will spend some time thinking about those next. It seems like we might be able to come up with some way of calculating a valid "default" value or "best guess" which could be used whenever there isn't enough data. Though, if we crash and lose some time stream data, we won't know that that data was lost due to a crash so we wouldn't know to use our "best guess" to make up for it. So, maybe I should try and rebuild the stream using some combination of WAL, clog, and commit timestamps? Or perhaps I should do some basic WAL logging just for this data structure. > > Andres had brought up something at some point about, what if the > > database is simply turned off for awhile and then turned back on. Even > > if you cleanly shut down, will there be "gaps" in the timeline? I > > think that could be okay, but it is something to think about. > > The gaps would represent reality, so there is nothing wrong per se with gaps, > but if they inflate the interval of a bucket which in turns impact the > precision of the results then that can be a problem. Yes, actually I added some hacky code to the quick and dirty python simulator I wrote [2] to test out having a big gap with no updates (if there is no db activity so nothing for bgwriter to do or the db is off for a while). And it seemed to basically work fine. > While the bucketing algorithm is a clever algorithm for degrading precision for > older entries without discarding them, I do fear that we'll risk ending up with > answers like "somewhere between in the past and even further in the past". > I've been playing around with various compression algorithms for packing the > buckets such that we can retain precision for longer. Since you were aiming to > work on other patches leading up to the freeze, let's pick this up again once > things calm down. Let me know what you think about the new algorithm. I wonder if for points older than the second to oldest point, we have the function return something like "older than a year ago" instead of guessing. The new method doesn't focus on compressing old data though. > When compiling I got this warning for lsntime_merge_target: > > pgstat_wal.c:242:1: warning: non-void function does not return a value in all control paths [-Wreturn-type] > } > ^ > 1 warning generated. > > The issue seems to be that the can't-really-happen path is protected with an > Assertion, which is a no-op for production builds. I think we should handle > the error rather than rely on testing catching it (since if it does happen even > though it can't, it's going to be when it's for sure not tested). Returning an > invalid array subscript like -1 and testing for it in lsntime_insert, with an > elog(WARNING..), seems enough. > > > - last_snapshot_lsn <= GetLastImportantRecPtr()) > + last_snapshot_lsn <= current_lsn) > I think we should delay extracting the LSN with GetLastImportantRecPtr until we > know that we need it, to avoid taking locks in this codepath unless needed. > > I've attached a diff with the above suggestions which applies on op of your > patchset. I've implemented these review points in the attached v4. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_YbbZGz-X_pm2zXJA%2B6A22YYpaWhOjmytqFL1yF_FCv6w%40mail.gmail.com [2] https://gist.github.com/melanieplageman/7400e81bbbd518fe08b4af55a9b632d4
Вложения
Thanks so much Bharath, Andrey, and Ilya for the review! I've posted a new version here [1] which addresses some of your concerns. I'll comment on those it does not address inline. On Thu, May 30, 2024 at 1:26 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > === Questions === > 1. The patch does not handle server restart. All pages will need freeze after any crash? I haven't fixed this yet. See my email for some thoughts on what I should do here. > 2. Some benchmarks to proof the patch does not have CPU footprint. This is still a todo. Typically when designing a benchmark like this, I would want to pick a worst-case workload to see how bad it could be. I wonder if just a write heavy workload like pgbench builtin tpcb-like would be sufficient? > === Nits === > "Timeline" term is already taken. I changed it to LSNTimeStream. What do you think? > The patch needs rebase due to some header changes. I did this. > Tests fail on Windows. I think this was because of the compiler warnings, but I need to double-check now. > The patch lacks tests. I thought about this a bit. I wonder what kind of tests make sense. I could 1) Add tests with the existing stats tests (src/test/regress/sql/stats.sql) and just test that bgwriter is in fact adding to the time stream. 2) Or should I add some infrastructure to be able to create an LSNTimeStream and then insert values to it and do some validations of what is added? I did a version of this but it is just much more annoying with C & SQL than with python (where I tried out my algorithms) [2]. > Some docs would be nice, but the feature is for developers. I added some docs. > Mapping is protected for multithreaded access by walstats LWlock and might have tuplestore_putvalues() under that lock.That might be a little dangerous, if tuplestore will be on-disk for some reason (should not happen). Ah, great point! I forgot about the *fetch_stat*() functions. I used pgstat_fetch_stat_wal() in the latest version so I have a local copy that I can stuff into the tuplestore without any locking. It won't be as up-to-date, but I think that is 100% okay for this function. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_a6WSkWPtJCw%3DW%2BP%2Bg-Fw9kfA_t8sMx99dWpMiGHCqJNA%40mail.gmail.com [2] https://gist.github.com/melanieplageman/95126993bcb43d4b4042099e9d0ccc11
On Wed, Jun 26, 2024 at 10:04 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > I've implemented these review points in the attached v4. I realized the docs had a compilation error. Attached v5 fixes that as well as three bugs I found while using this patch set more intensely today. I see Michael has been working on some crash safety for stats here [1]. I wonder if that would be sufficient for the LSNTimeStream. I haven't examined his patch functionality yet, though. I also had an off-list conversation with Robert where he suggested I could perhaps change the user-facing functions for estimating an LSN/time conversion to instead return a floor and a ceiling -- instead of linearly interpolating a guess. This would be a way to keep users from misunderstanding the accuracy of the functions to translate LSN <-> time. I'm interested in what others think of this. I like this idea a lot because it allows me to worry less about how I decide to compress the data and whether or not it will be accurate for use cases different than my own (the opportunistic freezing heuristic). If I can provide a floor and a ceiling that are definitely accurate, I don't have to worry about misleading people. - Melanie [1] https://www.postgresql.org/message-id/ZnEiqAITL-VgZDoY%40paquier.xyz