Re: Logical archiving

Поиск
Список
Период
Сортировка
От Andrey Borodin
Тема Re: Logical archiving
Дата
Msg-id 2B44FA4B-7500-4B37-82BD-BFACA20001AD@yandex-team.ru
обсуждение исходный текст
Ответ на Re: Logical archiving  (Craig Ringer <craig.ringer@enterprisedb.com>)
Ответы Re: Logical archiving  (Craig Ringer <craig.ringer@enterprisedb.com>)
Список pgsql-hackers
Thanks Craig!
Probably, I should better ask in your nearby thread about logical replication, it just seemed to me that logical
archivingis somewhat small independent piece of functionality... 

> 7 дек. 2020 г., в 08:05, Craig Ringer <craig.ringer@enterprisedb.com> написал(а):
>
> Reply follows inline. I addressed your last point first, so it's out of order.
>
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin <x4mmm@yandex-team.ru> wrote
>
> > If OLAP cannot consume data fast enough - we are out of space due to repl slot.
>
> There is a much simpler solution to this than logical PITR.
>
> What we should be doing is teaching xlogreader how to invoke the restore_command to fetch archived WALs for decoding.

> Replication slots already have a WAL retention limit, but right now when that limit is reached the slot is
invalidatedand becomes useless, it's effectively dropped. Instead, if WAL archiving is enabled, we should leave the
slotas valid. If a consumer of the slot needs WAL that no longer exists in pg_wal, we should have the walsender invoke
therestore_command to read the missing WAL segment, decode it, and remove it again. 
>
> This would not be a technically difficult patch, and it's IMO one of the more important ones for improving logical
replication.
Currently we have restore_command in regular config, not in recovery.conf, so, probably, it should not be a very big
dealto implement this. 
>
> > I was discussing problems of CDC with scientific community and they asked this simple question: "So you have
efficientWAL archive on a very cheap storage, why don't you have a logical archive too?" 
>
> I've done work in this area, as has Petr (CC'd).
>
> In short, logical archiving and PITR is very much desirable, but we're not nearly ready for it yet and we're missing
alot of the foundations needed to make it really useful. 
>
> IMO the strongest pre-requisite is that we need integrated DDL capture and replication in Pg. While this could be
implementedin the publisher/subscriber logic for logical replication, it would make much more sense (IMO) to make it
easierto feed DDL events into any logical replication output plugin. 
>
> pglogical3 (the closed one) has quite comprehensive DDL replication support. Doing it is not simple though - there
areplenty of complexities: 
>
> * Reliably identifying the target objects and mapping them to replication set memberships for DML-replication
> * Capturing, replicating and managing the search_path and other DDL execution context (DateStyle and much more)
reliably
>     • Each statement type needs specific logic to indicate whether it needs DDL replication (and often filter
functionssince we have lots of sub-types where some need replication and some don't) 
>     • Handling DDL affecting global objects in pg_global correctly, like those affecting roles, grants, database
securitylabels etc. There's no one right answer for this, it depends on the deployment and requires the user to
cooperate.
>     • Correct handling of transactions that mix DDL and DML (mostly only an issue for multimaster).
>     • Identifying statements that target a mix of replicated and non-replicated objects and handling them
appropriately,including for CASCADEs 
>     • Gracefully handling DDL statements that mix TEMPORARY and persistent targets. We can do this ok for DROPs but
itstill requires care. Anything else gets messier. 
>     • Lack of hooks into table rewrite operations and the extremely clumsy and inefficient way logical decoding
currentlyexposes decoding of the temp-table data during decoding of rewrites means handling table-rewriting DDL is
difficultand impractical to do correctly. In pglogical we punt on it entirely and refuse to permit DDL that would
rewritea table except where we can prove it's reliant only on immutable inputs so we can discard the upstream rewrite
andrely on statement replication. 
>     • As a consequence of the above, reliably determining whether a given statement will cause a table rewrite.
>     • Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
>     • Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing something sensible for truncate cascade.
>     • Probably more I've forgotten
>
> If we don't handle these, then any logical change-log archives will become largely useless as soon as there's any
schemachange. 
>
> So we kind of have to solve DDL replication first IMO.
>
> Some consideration is also required for metadata management. Right now relation and type metadata has
session-lifetime,but you'd want to be able to discard old logical change-stream archives and have the later ones still
beusable. So we'd need to define some kind of restartpoint where we repeat the metadata, or we'd have to support
externalizingthe metadata so it can be retained when the main change archives get aged out. 
>
> We'd also need to separate the existing apply worker into a "receiver" and "apply/writer" part, so the wire-protocol
handlingisn't tightly coupled with the actual change apply code, in order to make it possible to actually consume those
archivesand apply them to the database. In pglogical3 we did that by splitting them into two processes, connected by a
shm_mq.Originally the process split was optional and you could run a combined receiver/writer process without the
shm_mqif you wanted, but we quickly found it difficult to reliably handle locking issues etc that way so the writers
allmoved out-of-process. 
>
> That was done mainly to make it possible to support parallelism in logical decoding apply. But we also have the
intentionof supporting an alternative reader process that can ingest "logical archives" and send them to the writer to
applythem, as if they'd been received from the on-wire stream. That's not implemented at this time though. It'd be
usefulfor a number of things: 
>
> * PITR-style logical replay and recovery
> * Ability to pre-decode a txn once on the upstream then send the buffered protocol-stream to multiple subscribers,
savingon logical decoding and reorder buffering overheads and write-multiplication costs 
> * ability to ingest change-streams generated by non-postgres sources so we could support streaming foreign-data
ingestion,streaming OLAP and data warehousing, etc 
>
> To make logical PITR more useful we'd also want to be a bit more tolerant of schema divergence, though that's not
overlyhard to do: 
>     • fill defaults for downstream columns if no value is present for the column in the upstream row and the
downstreamcolumn is nullable or has a default (I think built-in logical rep does this one already) 
>     • ignore values for columns in upstream data if the downstream table lacks the column and the upstream value is
null
>     • optionally allow apply to be configured to ignore non-null data in upstream columns that're missing on
downstream
>     • optionally allow apply to be configured to drop rows on the floor if the downstream table is missing
>     • policies for handling data conflicts like duplicate PKs
> and we'd probably want ways to filter the apply data-stream to apply changes for only a subset of tables, rows, etc
atleast in a later version. 
>
> None of this is insurmountable. Most or all of the DDL replication support and divergence-tolerance stuff is already
donein production deployments using pglogical3 and bdr3. 
I really like this wording for "divergence-tolerance" stuff, it captures problems I want to solve. I believe it's
somewhatorthogonal to other issues. 

>
> While I can't share the code, I am happy to share the experience I have gained from my part in working on these
things.As you've probably recently seen with the wiki article I wrote on physical/logical failover interop. 
>
> You're free to take information like this and use it in wiki articles too.
>
> Right now I won't be able to launch into writing big patches for these things, but I'll do my best to share what I
canand review things. 
>
> > This seems like a wild idea. But really, we have a super expensive NVMe drives for OLTP workload. And use this
devicesto store buffer for data to be dumped into MapReduce\YT analytical system. 
>
> It's not a wild idea at all, as noted above.
>
> In pglogical3 we already support streaming decoded WAL data to alternative writer downstreams including RabbitMQ and
Kafkavia writer plugins. 
Yes, Yandex.Cloud Transfer Manger supports it too. But it has to be resynced after physical failover. And internal
installationof YC have mandatory drills: few times in a month one datacenter is disconnected and failover happens for
thousandsa DBS. 

Thank you for your input. Probably, I'll put some efforts into loading missing WAL as a first step towards bright
future:) 

Best regards, Andrey Borodin.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: small cleanup in unicode_norm.c
Следующее
От: Konstantin Knizhnik
Дата:
Сообщение: Re: Wrong check in pg_visibility?