Re: Logical archiving
От | Andrey Borodin |
---|---|
Тема | Re: Logical archiving |
Дата | |
Msg-id | 2B44FA4B-7500-4B37-82BD-BFACA20001AD@yandex-team.ru обсуждение исходный текст |
Ответ на | Re: Logical archiving (Craig Ringer <craig.ringer@enterprisedb.com>) |
Ответы |
Re: Logical archiving
(Craig Ringer <craig.ringer@enterprisedb.com>)
|
Список | pgsql-hackers |
Thanks Craig! Probably, I should better ask in your nearby thread about logical replication, it just seemed to me that logical archivingis somewhat small independent piece of functionality... > 7 дек. 2020 г., в 08:05, Craig Ringer <craig.ringer@enterprisedb.com> написал(а): > > Reply follows inline. I addressed your last point first, so it's out of order. > > On Fri, 4 Dec 2020 at 15:33, Andrey Borodin <x4mmm@yandex-team.ru> wrote > > > If OLAP cannot consume data fast enough - we are out of space due to repl slot. > > There is a much simpler solution to this than logical PITR. > > What we should be doing is teaching xlogreader how to invoke the restore_command to fetch archived WALs for decoding. > Replication slots already have a WAL retention limit, but right now when that limit is reached the slot is invalidatedand becomes useless, it's effectively dropped. Instead, if WAL archiving is enabled, we should leave the slotas valid. If a consumer of the slot needs WAL that no longer exists in pg_wal, we should have the walsender invoke therestore_command to read the missing WAL segment, decode it, and remove it again. > > This would not be a technically difficult patch, and it's IMO one of the more important ones for improving logical replication. Currently we have restore_command in regular config, not in recovery.conf, so, probably, it should not be a very big dealto implement this. > > > I was discussing problems of CDC with scientific community and they asked this simple question: "So you have efficientWAL archive on a very cheap storage, why don't you have a logical archive too?" > > I've done work in this area, as has Petr (CC'd). > > In short, logical archiving and PITR is very much desirable, but we're not nearly ready for it yet and we're missing alot of the foundations needed to make it really useful. > > IMO the strongest pre-requisite is that we need integrated DDL capture and replication in Pg. While this could be implementedin the publisher/subscriber logic for logical replication, it would make much more sense (IMO) to make it easierto feed DDL events into any logical replication output plugin. > > pglogical3 (the closed one) has quite comprehensive DDL replication support. Doing it is not simple though - there areplenty of complexities: > > * Reliably identifying the target objects and mapping them to replication set memberships for DML-replication > * Capturing, replicating and managing the search_path and other DDL execution context (DateStyle and much more) reliably > • Each statement type needs specific logic to indicate whether it needs DDL replication (and often filter functionssince we have lots of sub-types where some need replication and some don't) > • Handling DDL affecting global objects in pg_global correctly, like those affecting roles, grants, database securitylabels etc. There's no one right answer for this, it depends on the deployment and requires the user to cooperate. > • Correct handling of transactions that mix DDL and DML (mostly only an issue for multimaster). > • Identifying statements that target a mix of replicated and non-replicated objects and handling them appropriately,including for CASCADEs > • Gracefully handling DDL statements that mix TEMPORARY and persistent targets. We can do this ok for DROPs but itstill requires care. Anything else gets messier. > • Lack of hooks into table rewrite operations and the extremely clumsy and inefficient way logical decoding currentlyexposes decoding of the temp-table data during decoding of rewrites means handling table-rewriting DDL is difficultand impractical to do correctly. In pglogical we punt on it entirely and refuse to permit DDL that would rewritea table except where we can prove it's reliant only on immutable inputs so we can discard the upstream rewrite andrely on statement replication. > • As a consequence of the above, reliably determining whether a given statement will cause a table rewrite. > • Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc. > • Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing something sensible for truncate cascade. > • Probably more I've forgotten > > If we don't handle these, then any logical change-log archives will become largely useless as soon as there's any schemachange. > > So we kind of have to solve DDL replication first IMO. > > Some consideration is also required for metadata management. Right now relation and type metadata has session-lifetime,but you'd want to be able to discard old logical change-stream archives and have the later ones still beusable. So we'd need to define some kind of restartpoint where we repeat the metadata, or we'd have to support externalizingthe metadata so it can be retained when the main change archives get aged out. > > We'd also need to separate the existing apply worker into a "receiver" and "apply/writer" part, so the wire-protocol handlingisn't tightly coupled with the actual change apply code, in order to make it possible to actually consume those archivesand apply them to the database. In pglogical3 we did that by splitting them into two processes, connected by a shm_mq.Originally the process split was optional and you could run a combined receiver/writer process without the shm_mqif you wanted, but we quickly found it difficult to reliably handle locking issues etc that way so the writers allmoved out-of-process. > > That was done mainly to make it possible to support parallelism in logical decoding apply. But we also have the intentionof supporting an alternative reader process that can ingest "logical archives" and send them to the writer to applythem, as if they'd been received from the on-wire stream. That's not implemented at this time though. It'd be usefulfor a number of things: > > * PITR-style logical replay and recovery > * Ability to pre-decode a txn once on the upstream then send the buffered protocol-stream to multiple subscribers, savingon logical decoding and reorder buffering overheads and write-multiplication costs > * ability to ingest change-streams generated by non-postgres sources so we could support streaming foreign-data ingestion,streaming OLAP and data warehousing, etc > > To make logical PITR more useful we'd also want to be a bit more tolerant of schema divergence, though that's not overlyhard to do: > • fill defaults for downstream columns if no value is present for the column in the upstream row and the downstreamcolumn is nullable or has a default (I think built-in logical rep does this one already) > • ignore values for columns in upstream data if the downstream table lacks the column and the upstream value is null > • optionally allow apply to be configured to ignore non-null data in upstream columns that're missing on downstream > • optionally allow apply to be configured to drop rows on the floor if the downstream table is missing > • policies for handling data conflicts like duplicate PKs > and we'd probably want ways to filter the apply data-stream to apply changes for only a subset of tables, rows, etc atleast in a later version. > > None of this is insurmountable. Most or all of the DDL replication support and divergence-tolerance stuff is already donein production deployments using pglogical3 and bdr3. I really like this wording for "divergence-tolerance" stuff, it captures problems I want to solve. I believe it's somewhatorthogonal to other issues. > > While I can't share the code, I am happy to share the experience I have gained from my part in working on these things.As you've probably recently seen with the wiki article I wrote on physical/logical failover interop. > > You're free to take information like this and use it in wiki articles too. > > Right now I won't be able to launch into writing big patches for these things, but I'll do my best to share what I canand review things. > > > This seems like a wild idea. But really, we have a super expensive NVMe drives for OLTP workload. And use this devicesto store buffer for data to be dumped into MapReduce\YT analytical system. > > It's not a wild idea at all, as noted above. > > In pglogical3 we already support streaming decoded WAL data to alternative writer downstreams including RabbitMQ and Kafkavia writer plugins. Yes, Yandex.Cloud Transfer Manger supports it too. But it has to be resynced after physical failover. And internal installationof YC have mandatory drills: few times in a month one datacenter is disconnected and failover happens for thousandsa DBS. Thank you for your input. Probably, I'll put some efforts into loading missing WAL as a first step towards bright future:) Best regards, Andrey Borodin.
В списке pgsql-hackers по дате отправления: