Catalog/Metadata consistency during changeset extraction from wal

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Catalog/Metadata consistency during changeset extraction from wal
Дата
Msg-id 201206211341.25322.andres@2ndquadrant.com
обсуждение исходный текст
Ответы Re: Catalog/Metadata consistency during changeset extraction from wal  (Florian Pflug <fgp@phlo.org>)
Re: Catalog/Metadata consistency during changeset extraction from wal  (Florian Pflug <fgp@phlo.org>)
Re: Catalog/Metadata consistency during changeset extraction from wal  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Catalog/Metadata consistency during changeset extraction from wal  (Amit Kapila <amit.kapila@huawei.com>)
Re: Catalog/Metadata consistency during changeset extraction from wal  (Simon Riggs <simon@2ndquadrant.com>)
Re: Catalog/Metadata consistency during changeset extraction from wal  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
Hi Robert, Hi all,

Robert and I talked quite a bit about different methods of providing enough 
information to extract tuples from wal. I don't think either of us is yet 
really convinced of any individual method, so I want to recap our discussion 
in one email so others can chime in without reading the already huge thread.

I hope I am not misrepesenting Roberts opinion here, but I am sure he will 
correct me if I do ;)

To satisfy the different needs people have for changeset extraction we 
currently think that for each individual tuple extracted from wal we need to 
provide the following information:

a) one or more pieces of tuple data (HeapTupleData structs) * INSERT: full new tuple * UPDATE: full new tuple, old pkey
(optionallythe full old tuple) * DELETE: old pkey (optionally the full old tuple)
 
b) the action performed (INSERT|UPDATE|DELETE)
c) the table on which the action was performed
d) access to the table structure (names, column types, ...) procs (*_out 
functions for the individual columns)

The problem with getting that data is that at the point were decoding the wal 
the catalog may have evolved significantly from the state it was in when the 
tuple was put into the wal.
We can extract a) and b) without any problems (lets not talk about it here) 
but we don't necessarily know how to make sense of the data because a 
HeapTuple cannot be properly interpreted without the knowledge of c) and d).

I am of the opinion that c) is basically equivalent to solving d) because the 
wal only contains the tuple (pg_database.oid, pg_tablespace.oid, 
pg_class.relfilenode) of the table and not the 'pg_class.oid'. The relfilenode 
is changed by operations that rewrite the table like ALTER TABLE ADD COLUMN 
... DEFAULT ...; TRUNCATE; CLUSTER and some others.

A single transaction can contain tuples for different relfilenodes and with 
different columns:

CREATE TABLE foo(id serial primary key, data text);
BEGIN;
INSERT INTO foo ...;
TRUNCATE foo;
INSERT INTO foo ...; -- same structure, different relfilenode

ALTER TABLE foo ADD COLUMN bar text;
INSERT INTO foo ...; -- same relfilenode, different table structure

ALTER TABLE foo ADD COLUMN zaphod text DEFAULT '';
INSERT INTO foo ...; -- different relfilenode, different table structure
COMMIT;

There are several additional complex scenarios.

In http://archives.postgresql.org/message-
id/201206192023.20589.andres@2ndquadrant.com I listed which options I see for 
reaching that goal.

A refined version of that list:


1.)
Decode on a different, possibly catalog-only, pg instance kept in sync using 
the command trigger infrastructure (but not necessarily user-level defined 
command triggers)

If the command/event trigger infrastructure logs into a system-catalog table 
keeping the catalog in the correct state is relatively easy. When 
replaying/converting a reassembled transaction everytime an INSERT into that 
system table happens the contained DDL gets performed. 
The locking on the generating side takes care of the concurrency aspects.

Advantages:
* minimal overhead (space, performance)
* allows additional tables/indexes/triggers if you take care with oid 
allocation
* easy transactionally correct catalog behaviour behaviour
* the decoding instance can be used to store all data in a highly efficient 
manner (no decoding, no full detoasting, ...)  * the decoding instance is fully writable without problems if you don't

generate conflicts (separate tables, non-overlapping writes, whatever)
* implementable in a pretty unintrusive way

Disadvantes:
* the table structure of replicated tables needs to be exactly the same
* the type definition + support procs needs to be similar enough to read the 
data
* error checking of the above isn't easy but probably possible
* full version/architecture compatibility required
* a proxy instance required even if you want to replicate into some other 
system/architecture/version

2.)
Keep the decoding site up2date by replicating the catalog via normal HS 
recovery 
mechanisms.

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync, 
including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code 
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information 
to be queryable after a crash/shutdown
* some complex logic/low-level fudging required to keep the transactional 
behaviour sensible when querying the catalog
* full version/architecture compatibility required
* the decoding site will always ever be only readable

3)
Multi-Versioned catalog

Below are two possible implementation strategies for that concept

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that 
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back 
to textual versions on mismatch

Disadvantages:
* decoding probably has to happen on the master which might not be what people 
want performancewise

3a)
Change the system catalogs to be versioned

Advantages.
* catalog access is easy
* might be interesting for other users

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off

3b)
Ensure that enough information in the catalog remains by fudging the xmin 
horizon. Then reassemble an appropriate snapshot to read the catalog as the 
tuple in question has seen it.

Advantages:
* should be implementable with low impact to general code

Disadvantages:
* requires some complex code for assembling snapshots
* it might be hard to guarantee that we always have enough information to 
reassemble a snapshot (subxid overflows ...)
* impacts vacuum if replication to some site is slow

4.)
Log enough information in the walstream to make decoding possible using only 
the walstream.

Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent

Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog information 
needs to be logged in a transactional manner repeatedly
* misuses wal far more than other methods
* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible architecture 
independent

5.)
The actually good idea. Yours?


-----

I think 3a) is not likely to fly and I think 4) is too complex although Robert 
isn't convinced of the latter argument.

In my opinion either 3b) or 1) are our best options because they seem to 
support most of the usecases without huge costs in complexity and 
runtime/space for users not using changeset extraction. 3b) seems to be 
preferrable because imo its the most flexible choice and doesn't require a 
second instance.

Any arguments against/for those?

I am going to talk about implementing 3b) in a separate email in a bit.

Greetings,

Andres


-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: pl/perl and utf-8 in sql_ascii databases
Следующее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: pl/perl and utf-8 in sql_ascii databases