Re: Show various offset arrays for heap WAL records
От | Melanie Plageman |
---|---|
Тема | Re: Show various offset arrays for heap WAL records |
Дата | |
Msg-id | CAAKRu_b2oE4GL=q4g9mcByS9yT7wTQvEH9OLpabj28e+WKFi2A@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Show various offset arrays for heap WAL records (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: Show various offset arrays for heap WAL records
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-hackers |
Attached v3 is cleaned up and includes a pg_walinspect docs update as well as some edited comments in rmgr_utils.c On Mon, Mar 27, 2023 at 6:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I went to add dedup records and noticed that since the actual > > BTDedupInterval struct is what is put in the xlog, I would need access > > to that type from nbtdesc.c, however, including nbtree.h doesn't seem to > > work because it includes files that cannot be included in frontend code. > > I suppose that the BTDedupInterval struct could have just as easily > gone in nbtxlog.h, next to xl_btree_dedup. There might have been a > moment where I thought about doing it that way, but I guess I found it > slightly preferable to use that symbol name (BTDedupInterval) rather > than (say) xl_btree_dedup_interval in places like the nearby > BTDedupStateData struct. > > Actually, I suppose that it's hard to make that alternative work, at > least without > including nbtxlog.h in nbtree.h. Which sounds wrong. > > > I, of course, could make some local struct in nbtdesc.c which has an > > OffsetNumber and a uint16, since the BTDedupInterval is pretty > > straightforward, but that seems a bit annoying. > > I'm probably missing something obvious, but is there a better way to do > > this? > > It was probably just one of those cases where I settled on the > arrangement that looked least odd overall. Not a particularly > principled approach. But the approach that I'm going to take once more > here. ;-) > > All of the available alternatives are annoying in roughly the same > way, though perhaps to varying degrees. All except one: I'm okay with > just not adding coverage for deduplication records, for the time being > -- just seeing the number of intervals alone is relatively informative > with deduplication records, unlike (say) nbtree delete records. I'm > also okay with having coverage for dedup records if you feel it's > worth having. Your call. > > If we're going to have coverage for deduplication records then it > seems to me that we have to have a struct in nbtxlog.h for your code > to work off of. It also seems likely that we'll want to use that same > struct within nbtxlog.c. What's less clear is what that means for the > BTDedupInterval struct. I don't think that we should include nbtxlog.h > in nbtree.h, nor should we do the converse. > > I guess maybe two identical structs would be okay. BTDedupInterval, > and xl_btree_dedup_interval, with the former still used in nbtdedup.c, > and the latter used through a pointer at the point that nbtxlog.c > reads a dedup record. Then maybe at a sizeof() static assert beside > the existing btree_xlog_dedup() assertions that check that the dedup > state interval array matches the array taken from the WAL record. > That's still a bit weird, but I find it preferable to any alternative > that I can think of. I've omitted enhancements for the dedup record type for now. > > On another note, I've thought about how to include some example output > > in docs, and, for example we could modify the example output in the > > pgwalinspect docs which includes a PRUNE record already for > > pg_get_wal_record_info() docs. We'd probably just want to keep it short. > > Yeah. Perhaps a PRUNE record for one of the system catalogs whose > relfilenode is relatively recognizable. Say pg_class. It probably > doesn't matter that much, but there is perhaps some small value in > picking an example that is relatively easy to recreate later on (or to > approximately recreate). I'm certainly not insisting on that, though. I've added such an example to pg_walinspect docs. On Tue, Mar 21, 2023 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > > There are several different things that seem important to me > > personally. These are in tension with each other, to a degree. These > > are: > > > > 1. Like Andres, I'd really like to have some way of inspecting things > > like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant > > detail. These record types happen to be very important in general, and > > the ability to see detailed information about the WAL record would > > definitely help with some debugging scenarios. I've really missed > > stuff like this while debugging serious issues under time pressure. > > One problem that I often run into when performing analysis of VACUUM > using pg_walinspect is the issue of *who* pruned which heap page, for > any given PRUNE record. Was it VACUUM/autovacuum, or was it > opportunistic pruning? There is no way of knowing for sure right now. > You *cannot* rely on an xid of 0 as an indicator of a given PRUNE > record coming from VACUUM; it could just have been an opportunistic > prune operation that happened to take place when a SELECT query ran, > before any XID was ever allocated. > > I think that we should do something like the attached, to completely > avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's > similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all > XLOG_HEAP2 record types to indicate that they took place during > VACUUM, by XOR'ing the flag with the record type/info when > XLogInsert() is called. For now this is only used by PRUNE records. > Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM" > record_type, as well as the unadorned Heap2/PRUNE record_type, which > we'll now know must have been opportunistic pruning. > > The approach of using a bit in the style of the heapam init bit makes > sense to me, because the bit is available, and works in a way that is > minimally invasive. Also, one can imagine needing to resolve a similar > ambiguity in the future, when (say) opportunistic freezing is added. > > I think that it makes sense to treat this within the scope of > Melanie's ongoing work to improve the instrumentation of these records > -- meaning that it's in scope for Postgres 16. Admittedly this is a > slightly creative interpretation, so if others disagree then I won't > argue. This is quite a small patch, though, which makes debugging > significantly easier. I think that there could be a great deal of > utility in being able to easily "pair up" corresponding > "Heap2/PRUNE+BYVACUUM" and "Heap2/VACUUM" records in debugging > scenarios. I can imagine linking these to "Heap2/FREEZE_PAGE" and > "Heap2/VISIBLE" records, too, since they're all closely related record > types. I really like this idea and would find it useful. I reviewed the patch and tried it out and it worked for me and code looked fine as well. I didn't include it in the attached patchset because I don't feel confident enough in my own understanding of any potential implications of splitting up these record types to definitively endorse it. But, if someone else felt comfortable with it, I would like to see it in the tree. - Melanie
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Daniel GustafssonДата:
Сообщение: Re: Making background psql nicer to use in tap tests