Re: POC: Cleaning up orphaned files using undo logs
От | Robert Haas |
---|---|
Тема | Re: POC: Cleaning up orphaned files using undo logs |
Дата | |
Msg-id | CA+TgmoZ37JOrh25NG1+Kz6kLajMdxj+v2gvjfXN+g6cQFgqp5g@mail.gmail.com обсуждение исходный текст |
Ответ на | POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@enterprisedb.com>) |
Список | pgsql-hackers |
While I've been involved in the design discussions for this patch set, I haven't looked at any of the code personally in a very long time. I certainly don't claim to be an independent reviewer, and I encourage others to review this work also. That said, here are some review comments. I decided to start with 0005, as that has the user-facing documentation for this feature. There is a spurious whitespace-only hunk in monitoring.sgml. + <entry>Process ID of the backend currently attached to this undo log + for writing.</entry> or NULL/0/something if none? + each undo log that exists. Undo logs are extents within a contiguous + addressing space that have their own head and tail pointers. This sentence seems to me to have so little detail that it's not going to help anyone, and it also seems somewhat out-of-place here. I think it would be better to link to the longer explanation in the new storage section instead. + Each backend that has written undo data is associated with one or more undo extra space +<para> +Undo logs hold data that is used for rolling back and for implementing +MVCC in access managers that are undo-aware (currently "zheap"). The storage +format of undo logs is optimized for reusing existing files. +</para> I think the mention of zheap should be removed here since the hope is that the undo stuff can be committed independently of and prior to zheap. I think you mean access methods, not access managers. I suggest making that an xref. Maybe add a little more detail, e.g. Undo logs provide a place for access methods to store data that can be used to perform necessary cleanup operations after a transaction abort. The data will be retained after a transaction abort until the access method successfully performs the required cleanup operations. After a transaction commit, undo data will be retained until the transaction is all-visible. This makes it possible for access managers to use undo data to implement MVCC. Since it most cases undo data is discarded very quickly, the undo system has been optimized to minimize writes to disk and to reuse existing files efficiently. +<para> +Undo data exists in a 64 bit address space broken up into numbered undo logs +that represent 1TB extents, for efficient management. The space is further +broken up into 1MB segment files, for physical storage. The name of each file +is the address of of the first byte in the file, with a period inserted after +the part that indicates the undo log number. +</para> I cannot read this section and know what an undo filename is going to look like. Also, the remarks about efficient management seems like it might be unclear to someone not already familiar with how this works. Maybe something like: Undo data exists in a 64-bit address space divided into 2^34 undo logs, each with a theoretical capacity of 1TB. The first time a backend writes undo, it attaches to an existing undo log whose capacity is not yet exhausted and which is not currently being used by any other backend; or if no suitable undo log already exists, it creates a new one. To avoid wasting space, each undo log is further divided into 1MB segment files, so that segments which are no longer needed can be removed (possibly recycling the underlying file by renaming it) and segments which are not yet needed do not need to be physically created on disk. An undo segment file has a name like <example>, where <thing> is the undo log number and <thang> is the segment number. I think it's good to spell out the part about attaching to undo logs here, because when people look at pg_undo, the number of files will be roughly proportional to the number of backends, and we should try to help them understand - at least in general terms - why that happens. +<para> +Just as relations can have one of the three persistence levels permanent, +unlogged or temporary, the undo data that is generated by modifying them must +be stored in an undo log of the same persistence level. This enables the +undo data to be discarded at appropriate times along with the relations that +reference it. +</para> This is not quite general, because we're not necessarily talking about modifications to the files. In fact, in this POC, we're explicitly talking about the cleanup of the files themselves. Also, it's not technically correct to say that the persistence level has to match. You could put everything in permanent undo logs. It would just suck. Moving on to 0003, the developer documentation: +The undo log subsystem provides a way to store data that is needed for +a limited time. Undo data is generated whenever zheap relations are +modified, but it is only useful until (1) the generating transaction +is committed or rolled back and (2) there is no snapshot that might +need it for MVCC purposes. See src/backend/access/zheap/README for +more information on zheap. The undo log subsystem is concerned with Again, I think this should be rewritten to make it independent of zheap. We hope that this facility is not only usable by but will actually be used by other AMs. +their location within a 64 bit address space. Unlike redo data, the +addressing space is internally divided up unto multiple numbered logs. Except it's not totally unlike; cf. the log and seg arguments to XLogFileNameById. The xlog division is largely a historical accident of having to support systems with 32-bit arithmetic and has minimal consequences in practice, and it's a lot less noticeable now than it used to be, but it does still kinda exist. I would try to sharpen this wording a bit to de-emphasize the contrast over whether a log/seg distinction exists and instead just contrast multiple insertion points vs. a single one. +level code (zheap) is largely oblivious to this internal structure and Another zheap reference. +eviction provoked by memory pressure, then no disk IO is generated. I/O? +Keeping the undo data physically separate from redo data and accessing +it though the existing shared buffers mechanism allows it to be +accessed efficiently for MVCC purposes. And also non-MVCC purposes. I mean, it's not very feasible to do post-abort cleanup driven solely off the WAL, because the WAL segments might've been archived or recycled and there's no easy way to access the bits we want. Saying this is for MVCC purposes specifically seems misleading. +shared memory and can be inspected in the pg_stat_undo_logs view. For Replace "in" with "via" or "through" or something? +shared memory and can be inspected in the pg_stat_undo_logs view. For +each undo log, a set of properties called the undo log's meta-data are +tracked: "called the undo log's meta-data" seems a bit awkward. +* the "discard" pointer; data before this point has been discarded +* the "insert" pointer: new data will be written here +* the "end" pointer: a new undo segment file will be needed at this point why ; for the first and : for the others? +The three pointers discard, insert and end move strictly forwards +until the whole undo log has been exhausted. At all times discard <= +insert <= end. When discard == insert, the undo log is empty I think you should either remove "discard, insert and end" from this sentence, relying on people to remember the list they just read, or else punctuate it like this: The three pointers -- discard, insert, and end -- move... +logs are held in a fixed-sized pool in shared memory. The size of +the array is a multiple of max_connections, and limits the total size of +transactions. I think you should elaborate on "limits the total size of transactions." +The meta-data for all undo logs is written to disk at every +checkpoint. It is stored in files under PGDATA/pg_undo/, using the Even unlogged and temporary undo logs? +level of the relation being modified and the current value of the GUC Suggest: the corresponding relation +suitable undo log must be either found or created. The system should +stabilize on one undo log per active writing backend (or more if +different tablespaces are persistence levels are used). Won't edge effects drive the number up considerably? +and they cannot be accessed by other backend including undo workers. Grammar. Also, begs the question "so how does this work if the undo workers are frozen out?" +Responsibility for WAL-logging the contents of the undo log lies with +client code (ie zheap). While undolog.c WAL-logs all meta-data Another zheap reference. +hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo Suggest I/O rather than IO. I'll see if I can find time to actually review some of this code at some point. Regarding 0006, I can't help but notice that it is completely devoid of documentation and README updates, which will not do. Regarding 0007, that's an impressively small patch. ...Robert
В списке pgsql-hackers по дате отправления: