Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
От | Peter Geoghegan |
---|---|
Тема | Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations |
Дата | |
Msg-id | CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
(John Naylor <john.naylor@enterprisedb.com>)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations (Robert Haas <robertmhaas@gmail.com>) Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations (Peter Geoghegan <pg@bowt.ie>) |
Список | pgsql-hackers |
On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > I do see some value in that, too. Though it's not going to be a way of > turning off the early freezing stuff, which seems unnecessary (though > I do still have work to do on getting the overhead for that down). Attached is v7, a revision that overhauls the algorithm that decides what to freeze. I'm now calling it block-driven freezing in the commit message. Also included is a new patch, that makes VACUUM record zero free space in the FSM for an all-visible page, unless the total amount of free space happens to be greater than one half of BLCKSZ. The fact that I am now including this new FSM patch (v7-0006-*patch) may seem like a case of expanding the scope of something that could well do without it. But hear me out! It's true that the new FSM patch isn't essential. I'm including it now because it seems relevant to the approach taken with block-driven freezing -- it may even make my general approach easier to understand. The new approach to freezing is to freeze every tuple on a block that is about to be set all-visible (and thus set it all-frozen too), or to not freeze anything on the page at all (at least until one XID gets really old, which should be rare). This approach has all the benefits that I described upthread, and a new benefit: it effectively encourages the application to allow pages to "become settled". The main difference in how we freeze here (relative to v6 of the patch) is that I'm *not* freezing a page just because it was dirtied/pruned. I now think about freezing as an essentially page-level thing, barring edge cases where we have to freeze individual tuples, just because the XIDs really are getting old (it's an edge case when we can't freeze all the tuples together due to a mix of new and old, which is something we specifically set out to avoid now). Freezing whole pages ==================== When VACUUM sees that all remaining/unpruned tuples on a page are all-visible, it isn't just important because of cost control considerations. It's deeper than that. It's also treated as a tentative signal from the application itself, about the data itself. Which is: this page looks "settled" -- it may never be updated again, but if there is an update it likely won't change too much about the whole page. Also, if the page is ever updated in the future, it's likely that that will happen at a much later time than you should expect for those *other* nearby pages, that *don't* appear to be settled. And so VACUUM infers that the page is *qualitatively* different to these other nearby pages. VACUUM therefore makes it hard (though not impossible) for future inserts or updates to disturb these settled pages, via this FSM behavior -- it is short sighted to just see the space remaining on the page as free space, equivalent to any other. This holistic approach seems to work well for TPC-C/BenchmarkSQL, and perhaps even in general. More on TPC-C below. This is not unlike the approach taken by other DB systems, where free space management is baked into concurrency control, and the concept of physical data independence as we know it from Postgres never really existed. My approach also seems related to the concept of a "tenured generation", which is key to generational garbage collection. The whole basis of generational garbage collection is the generational hypothesis: "most objects die young". This is an empirical observation about applications written in GC'd programming languages actually behave, not a rigorous principle, and yet in practice it appears to always hold. Intuitively, it seems to me like the hypothesis must work in practice because if it didn't then a counterexample nemesis application's behavior would be totally chaotic, in every way. Theoretically possible, but of no real concern, since the program makes zero practical sense *as an actual program*. A Java program must make sense to *somebody* (at least the person that wrote it), which, it turns out, helpfully constrains the space of possibilities that any industrial strength GC implementation needs to handle well. The same principles seem to apply here, with VACUUM. Grouping logical rows into pages that become their "permanent home until further notice" may be somewhat arbitrary, at first, but that doesn't mean it won't end up sticking. Just like with generational garbage collection, where the application isn't expected to instruct the GC about its plans for memory that it allocates, that can nevertheless be usefully organized into distinct generations through an adaptive process. Second order effects ==================== Relating the FSM to page freezing/all-visible setting makes much more sense if you consider the second order effects. There is bound to be competition for free space among backends that access the free space map. By *not* freezing a page during VACUUM because it looks unsettled, we make its free space available in the traditional way instead. It follows that unsettled pages (in tables with lots of updates) are now the only place that backends that need more free space from the FSM can look -- unsettled pages therefore become a hot commodity, freespace-wise. A page that initially appeared "unsettled", that went on to become settled in this newly competitive environment might have that happen by pure chance -- but probably not. It *could* happen by chance, of course -- in which case the page will get dirtied again, and the cycle continues, for now. There will be further opportunities to figure it out, and freezing the tuples on the page "prematurely" still has plenty of benefits. Locality matters a lot, obviously. The goal with the FSM stuff is merely to make it *possible* for pages to settle naturally, to the extent that we can. We really just want to avoid hindering a naturally occurring process -- we want to avoid destroying naturally occuring locality. We must be willing to accept some cost for that. Even if it takes a few attempts for certain pages, constraining the application's choice of where to get free space from (can't be a page marked all-visible) allows pages to *systematically* become settled over time. The application is in charge, really -- not VACUUM. This is already the case, whether we like it or not. VACUUM needs to learn to live in that reality, rather than fighting it. When VACUUM considers a page settled, and the physical page still has a relatively large amount of free space (say 45% of BLCKSZ, a borderline case in the new FSM patch), "losing" so much free space certainly is unappealing. We set the free space to 0 in the free space map all the same, because we're cutting our losses at that point. While the exact threshold I've proposed is tentative, the underlying theory seems pretty sound to me. The BLCKSZ/2 cutoff (and the way that it extends the general rules for whole-page freezing) is intended to catch pages that are qualitatively different, as well as quantitatively different. It is a balancing act, between not wasting space, and the risk of systemic problems involving excessive amounts of non-HOT updates that must move a successor version to another page. It's possible that a higher cutoff (for example a cutoff of 80% of BLCKSZ, not 50%) will actually lead to *worse* space utilization, in addition to the downsides from fragmentation -- it's far from a simple trade-off. (Not that you should believe that 50% is special, it's just a starting point for me.) TPC-C ===== I'm going to talk about a benchmark that ran throughout the week, starting on Monday. Each run lasted 24 hours, and there were 2 runs in total, for both the patch and for master/baseline. So this benchmark lasted 4 days, not including the initial bulk loading, with databases that were over 450GB in size by the time I was done (that's 450GB+ for both the patch and master) . Benchmarking for days at a time is pretty inconvenient, but it seems necessary to see certain effects in play. We need to wait until the baseline/master case starts to have anti-wraparound VACUUMs with default, realistic settings, which just takes days and days. I make available all of my data for the Benchmark in question, which is way more information that anybody is likely to want -- I dump anything that even might be useful from the system views in an automated way. There are html reports for all 4 24 hour long runs. Google drive link: https://drive.google.com/drive/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR?usp=sharing While the patch did well overall, and I will get to the particulars towards the end of the email, I want to start with what I consider to be the important part: the user/admin experience with VACUUM, and VACUUM's performance stability. This is about making VACUUM less scary. As I've said several times now, with an append-only table like pgbench_history we see a consistent pattern where relfrozenxid is set to a value very close to the same VACUUM's OldestXmin value (even precisely equal to OldestXmin) during each VACUUM operation, again and again, forever -- that case is easy to understand and appreciate, and has already been discussed. Now (with v7's new approach to freezing), a related pattern can be seen in the case of the two big, troublesome TPC-C tables, the orders and order lines tables. To recap, these tables are somewhat like the history table, in that new orders insert into both tables, again and again, forever. But they also have one huge difference to simple append-only tables too, which is the source of most of our problems with TPC-C. The difference is: there are also delayed, correlated updates of each row from each table. Exactly one such update per row for both tables, which takes place hours after each order's insert, when the earlier order is processed by TPC-C's delivery transaction. In the long run we need the data to age out and not get re-dirtied, as the table grows and grows indefinitely, much like with a simple append-only table. At the same time, we don't want to have poor free space management for these deferred updates. It's adversarial, sort of, but in a way that is grounded in reality. With the order and order lines tables, relfrozenxid tends to be advanced up to the OldestXmin used by the *previous* VACUUM operation -- an unmistakable pattern. I'll show you all of the autovacuum log output for the orders table during the second 24 hour long benchmark run: 2022-01-27 01:46:27 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1205349 remain, 887225 skipped using visibility map (73.61% of total) tuples: 253872 removed, 134182902 remain (26482225 newly frozen), 27193 are dead but not yet removable removable cutoff: 243783407, older by 728844 xids when operation ended new relfrozenxid: 215400514, which is 26840669 xids ahead of previous value ... 2022-01-27 05:54:39 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1345302 remain, 993924 skipped using visibility map (73.88% of total) tuples: 261656 removed, 150022816 remain (29757570 newly frozen), 29216 are dead but not yet removable removable cutoff: 276319403, older by 826850 xids when operation ended new relfrozenxid: 243838706, which is 28438192 xids ahead of previous value ... 2022-01-27 10:37:24 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1504707 remain, 1110002 skipped using visibility map (73.77% of total) tuples: 316086 removed, 167990124 remain (33754949 newly frozen), 33326 are dead but not yet removable removable cutoff: 313328445, older by 987732 xids when operation ended new relfrozenxid: 276309397, which is 32470691 xids ahead of previous value ... 2022-01-27 15:49:51 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1680649 remain, 1250525 skipped using visibility map (74.41% of total) tuples: 343946 removed, 187739072 remain (37346315 newly frozen), 38037 are dead but not yet removable removable cutoff: 354149019, older by 1222160 xids when operation ended new relfrozenxid: 313332249, which is 37022852 xids ahead of previous value ... 2022-01-27 21:55:34 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1886336 remain, 1403800 skipped using visibility map (74.42% of total) tuples: 389748 removed, 210899148 remain (43453900 newly frozen), 45802 are dead but not yet removable removable cutoff: 401955979, older by 1458514 xids when operation ended new relfrozenxid: 354134615, which is 40802366 xids ahead of previous value This mostly speaks for itself, I think. (Anybody that's interested can drill down to the logs for order lines, which looks similar.) The effect we see with the order/order lines table isn't perfectly reliable. Actually, it depends on how you define it. It's possible that we won't be able to acquire a cleanup lock on the wrong page at the wrong time, and as a result fail to advance relfrozenxid by the usual amount, once. But that effect appears to be both rare and of no real consequence. One could reasonably argue that we never fell behind, because we still did 99.9%+ of the required freezing -- we just didn't immediately get to advance relfrozenxid, because of a temporary hiccup on one page. We will still advance relfrozenxid by a small amount. Sometimes it'll be by only hundreds of XIDs when millions or tens of millions of XIDs were expected. Once we advance it by some amount, we can reasonably suppose that the issue was just a hiccup. On the master branch, the first 24 hour period has no anti-wraparound VACUUMs, and so looking at that first 24 hour period gives you some idea of how worse off we are in the short term -- the freezing stuff won't really start to pay for itself until the second 24 hour run with these mostly-default freeze related settings. The second 24 hour run on master almost exclusively has anti-wraparound VACUUMs for all the largest tables, though -- all at the same time. And not just the first time, either! This causes big spikes that the patch totally avoids, simply by avoiding anti-wraparound VACUUMs. With the patch, there are no anti-wraparound VACUUMs, barring tables that will never be vacuumed for any other reason, where it's still inevitable, limited to the stock table and customers table. It was a mistake for me to emphasize "no anti-wraparound VACUUMs outside pathological cases" before now. I stand by those statements as accurate, but anti-wraparound VACUUMs should not have been given so much emphasis. Let's assume that somehow we really were to get an anti-wraparound VACUUM against one of the tables where that's just not expected, like this orders table -- let's suppose that I got that part wrong, in some way. It would hardly matter at all! We'd still have avoided the freezing cliff during this anti-wraparound VACUUM, which is the real benefit. Chances are good that we needed to VACUUM anyway, just to clean any very old garbage tuples up -- relfrozenxid is now predictive of the age of the oldest garbage tuples, which might have been a good enough reason to VACUUM anyway. The stampede of anti-wraparound VACUUMs against multiple tables seems like it would still be fixed, since relfrozenxid now actually tells us something about the table (as opposed to telling us only about what the user set vacuum_freeze_min_age to). The only concerns that this leaves for me are all usability related, and not of primary importance (e.g. do we really need to make anti-wraparound VACUUMs non-cancelable now?). TPC-C raw numbers ================= The single most important number for the patch might be the decrease in both buffer misses and buffer hits, which I believe is caused by the patch being able to use index-only scans much more effectively (with modifications to BenchmarkSQL to improve the indexing strategy [1]). This is quite clear from pg_stat_database state at the end. Patch: xact_commit | 440,515,133 xact_rollback | 1,871,142 blks_read | 3,754,614,188 blks_hit | 174,551,067,731 tup_returned | 341,222,714,073 tup_fetched | 124,797,772,450 tup_inserted | 2,900,197,655 tup_updated | 4,549,948,092 tup_deleted | 165,222,130 Here is the same pg_stat_database info for master: xact_commit | 440,402,505 xact_rollback | 1,871,536 blks_read | 4,002,682,052 blks_hit | 283,015,966,386 tup_returned | 346,448,070,798 tup_fetched | 237,052,965,901 tup_inserted | 2,899,735,420 tup_updated | 4,547,220,642 tup_deleted | 165,103,426 The blks_read is x0.938 of master/baseline for the patch -- not bad. More importantly, blks_hit is x0.616 for the patch -- quite a significant reduction in a key cost. Note that we start to get this particular benefit for individual read queries pretty early on -- avoiding unsetting visibility map bits like this matters right from the start. In TPC-C terms, the ORDER_STATUS transaction will have much lower latency, particularly tail latency, since it uses index-only scans to good effect. There are 5 distinct transaction types from the benchmark, and an improvement to one particular transaction type isn't unusual -- so you often have to drill down, and look at the full html report. The latency situation is improved across the board with the patch, by quite a bit, especially after the second run. This server can sustain much more throughput than the TPC-C spec formally permits, even though I've increased the TPM rate from the benchmark by 10x the spec legal limit, so query latency is the main TPC-C metric of interest here. WAL === Then there's the WAL overhead. Like practically any workload, the WAL consumption for this workload is dominated by FPIs, despite the fact that I've tuned checkpoints reasonably well. The patch *does* write more WAL in the first set of runs -- it writes a total of ~3.991 TiB, versus ~3.834 TiB for master. In other words, during the first 24 hour run (before the trouble with the anti-wraparound freeze cliff even begins for the master branch), the patch writes x1.040 as much WAL in total. The good news is that the patch comes out ahead by the end, after the second set of 24 hour runs. By the time the second run finishes, it's 8.332 TiB of WAL total for the patch, versus 8.409 TiB for master, putting the patch at x0.990 in the end -- a small improvement. I believe that most of the WAL doesn't get generated by VACUUM here anyway -- opportunistic pruning works well for this workload. I expect to be able to commit the first 2 patches in a couple of weeks, since that won't need to block on making the case for the final 3 or 4 patches from the patch series. The early stuff is mostly just refactoring work that removes needless differences between aggressive and non-aggressive VACUUM operations. It makes a lot of sense on its own. [1] https://github.com/pgsql-io/benchmarksql/pull/16 -- Peter Geoghegan
Вложения
- v7-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patch
- v7-0005-Make-block-level-characteristics-drive-freezing.patch
- v7-0006-Add-all-visible-FSM-heuristic.patch
- v7-0003-Consolidate-VACUUM-xid-cutoff-logic.patch
- v7-0002-Add-VACUUM-instrumentation-for-scanned-pages-relf.patch
- v7-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patch
В списке pgsql-hackers по дате отправления: