Re: POC: Cleaning up orphaned files using undo logs
От | Dilip Kumar |
---|---|
Тема | Re: POC: Cleaning up orphaned files using undo logs |
Дата | |
Msg-id | CAFiTN-s2fFcegTBbUGABREZtG=cR0rAAX1diTbNXF0dSXttDSQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: POC: Cleaning up orphaned files using undo logs
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres@anarazel.de> wrote: > > On 2018-12-03 18:43:04 +1300, Thomas Munro wrote: > > > Sorry for my silence... I got stuck on a design problem with the lower > > > level undo log management code that I'm now close to having figured > > > out. I'll have a new patch soon. > > Hello all, > > Here's a new WIP version of this patch set. It builds on a fairly > deep stack of patches being developed by several people. As mentioned > before, it's a useful crash-test dummy for a whole stack of technology > we're working on, but it's also aiming to solve a real problem. > > It currently fails in one regression test for a well understood > reason, fix on the way (see end), and there are some other stability > problems being worked on. > > Here's a quick tour of the observable behaviour, having installed the > pg_buffercache and test_undorecord extensions: > > ================== > > postgres=# begin; > BEGIN > postgres=# create table foo (); > CREATE TABLE > > Check if our transaction has generated undo data: > > postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ; > logno | discard | insert | xid | pid > -------+------------------+------------------+-----+------- > 0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169 > (1 row) > > Here, we see that undo log number 0 has some undo data because discard > < insert. We can find out what it says: > > postgres=# call dump_undo_records(0); > NOTICE: 0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663, > relfile=16386; xid=476, next xact=0 > CALL > > The undo record shown there lives in shared buffers, and we can see > that it's in there with pg_buffercache (the new column smgrid 1 means > undo data; 0 is regular relation data): > > postgres=# select bufferid, smgrid, relfilenode, relblocknumber, > isdirty, usagecount from pg_buffercache where smgrid = 1; > bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount > ----------+--------+-------------+----------------+---------+------------ > 3 | 1 | 0 | 1 | t | 5 > (1 row) > > Even though that's just a dirty page in shared buffers, if we crash > now and recover, it'll be recreated by a new WAL record that was > flushed *before* creating the relation file. We can see that with > pg_waldump: > > rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel > 1663/0/0 blk 1 FPW > rmgr: Storage ... CREATE base/12916/16384 > > The PRECREATE record dirtied block 1 of undo log 0. In this case it > happened to include a FPW of the undo log page too, following the > usual rules. FPWs are rare for undo pages because of the > REGBUF_WILL_INIT optimisation that applies to the zeroed out pages > (which is most undo pages, due to the append-mostly access pattern). > > Finally, we if commit we see the undo data is discarded by a > background worker, and if we roll back explicitly or crash and run > recovery, the file is unlinked. Here's an example of the crash case: > > postgres=# begin; > BEGIN > postgres=# create table foo (); > CREATE TABLE > postgres=# select relfilenode from pg_class where relname = 'foo'; > relfilenode > ------------- > 16395 > (1 row) > > postgres=# select pg_backend_pid(); > pg_backend_pid > ---------------- > 39169 > (1 row) > > $ kill -9 39169 > > ... server restarts, recovers ... > > $ ls pgdata/base/12916/16395 > pgdata/base/12916/16395 > > It's still there, though it's been truncated by an undo worker (see > end of email). And finally, after the next checkpoint: > > $ ls pgdata/base/12916/16395 > ls: pgdata/base/12916/16395: No such file or directory > > That's the end of the quick tour. > > Most of these patches should probably be discussed in other threads, > but I'm posting a snapshot of the full stack here anyway. Here's a > patch-by-patch summary: > > === 0001 "Refactor the fsync mechanism to support future SMGR > implementations." === > > The 0001 patch has its own CF thread > https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath > (based on earlier work by me), but I'm including a copy here for > convenience/cfbot. > > === 0002 "Add SmgrId to smgropen() and BufferTag." === > > This is new, and is based on the discussion from another recent > thread[1] about how we should identify buffers belonging to different > storage managers. In earlier versions of the patch-set I had used a > special reserved DB OID for undo data. Tom Lane didn't like that idea > much, and Anton Shyrabokau (via Shawn Debnath) suggested making > ForkNumber narrower so we can add a new field to BufferTag, and Andres > Freund +1'd my proposal to add the extra value as a parameter to > smgropen(). So, here is a patch that tries those ideas. > > Another way to do this would be to widen RelFileNode instead, to avoid > having to pass around the SMGR ID separately in various places. > Looking at the number of places that have to chance, you can probably > see why we wanted to use a magic DB OID instead, and I'm not entirely > convinced that it wasn't better that way, or that I've found all the > places that need to carry an smgrid alongside a RelFileNode. > > Archeological note: smgropen() was like that ~15 years ago before > commit 87bd9563, but buffer tags didn't include the SMGR ID. > > I decided to call md.c's ID "SMGR_RELATION", describing what it really > holds -- regular relations -- rather than perpetuating the doubly > anachronistic "magnetic disk" name. > > While here, I resurrected the ancient notion of a per-SMGR 'open' > routine, so that a small amount of md.c-specific stuff could be kicked > out of smgr.c and future implementations can do their own thing here > too. > > While doing that work I realised that at least pg_rewind needs to > learn about how different storage managers map blocks to files, so > that's a new TODO item requiring more thought. I wonder what other > places know how to map { RelFileNode, ForkNumber, BlockNumber } to a > path + offset, and I wonder what to think about the fact that some of > them may be non-backend code... > > === 0003 "Add undo log manager." === > > This and the next couple of patches live in CF thread > https://commitfest.postgresql.org/22/1649/ but here's a much newer > snapshot that hasn't been posted there yet. > > Manages a set of undo logs in shared memory, manages undo segment > files, tracks discard, insert, end pointers visible in > pg_stat_undo_logs. With this patch you can allocate and discard space > in undo logs using the UndoRecPtr type to refer to addresses, but > there is no access to the data yet. Improvements since the last > version are not requiring DSM segments, proper FPW support and reduced > WAL traffic. Previously there were extra per-xact and per-checkpoint > records requiring retry-loops in code that inserted undo data. > > === 0004 "Provide access to undo log data via the buffer manager." === > > Provide SMGR_UNDO. While the 0003 patch deals with allocating and > discarding undo address space and makes sure that backing files exist, > this patch lets you read and write buffered data in them. > > === 0005 "Allow WAL record data on first modification after a checkpoint." === > > Provide a way for data to be attached to a WAL-registered block that > is only included if this turns out to be the first WAL record that > touches the block after a checkpoint. This is a bit like FPW images, > except that it's arbitrary extra data and happens even if FPW is off. > This is used to capture a copy of the (tiny) undo log meta-data > (primary the insertion pointer) to fix a consistency problem when > recovering from an online checkpoint. > > === 0006 + 0007 "Provide interfaces to store and fetch undo records." === > > This is a snapshot of work by my colleagues Dilip, Rafia and others > based on earlier prototyping by Robert. While the earlier patches > give you buffered binary undo data, this patch introduces the concept > of high level undo records that can be inserted, and read back given > an UndoRecPtr. This is a version presented on another thread already; > here it's lightly changed due to rebasing by me. > > Undo-aware modules should design a set of undo record types, and > insert exactly the same ones at do and undo time. > > The 0007 patch is fixups from me to bring that code into line with > changes to the lower level patches. Future versions will be squashed > and tidied up; still working on that. Currently, undo branch[1] contain an older version of the (undo interface + some fixup). Now, I have merged the latest changes from the zheap branch[2] to the undo branch[1] which can be applied on top of the undo storage commit[3]. For merging those changes, I need to add some changes to the undo log storage patch as well for handling the multi log transaction. So I have attached two patches, 1) improvement to undo log storage 2) complete undo interface patch which include 0006+0007 from undo branch[1] + new changes on the zheap branch. [1] https://github.com/EnterpriseDB/zheap/tree/undo [2] https://github.com/EnterpriseDB/zheap [3] https://github.com/EnterpriseDB/zheap/tree/undo (b397d96176879ed5b09cf7322b8d6f2edd8043a5) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Amit KapilaДата:
Сообщение: Re: Unhappy about API changes in the no-fsm-for-small-rels patch