Re: Catalog/Metadata consistency during changeset extraction from wal

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Catalog/Metadata consistency during changeset extraction from wal
Дата
Msg-id 201206221142.10603.andres@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Catalog/Metadata consistency during changeset extraction from wal  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
Hi,

On Friday, June 22, 2012 08:48:41 AM Simon Riggs wrote:
> On 21 June 2012 12:41, Andres Freund <andres@2ndquadrant.com> wrote:
> > 2.)
> > Keep the decoding site up2date by replicating the catalog via normal HS
> > recovery
> > mechanisms.
> > 
> > Advantages:
> > * most of the technology is already there
> > * minimal overhead (space, performance)
> > * no danger of out of sync catalogs
> > * no support for command triggers required that can keep a catalog in
> > sync, including oids
> > 
> > Disadvantages:
> > * driving the catalog recovery that way requires some somewhat intricate
> > code as it needs to be done in lockstep with decoding the wal-stream
> > * requires an additional feature to guarantee HS always has enough
> > information to be queryable after a crash/shutdown
> > * some complex logic/low-level fudging required to keep the transactional
> > behaviour sensible when querying the catalog
> > * full version/architecture compatibility required
> > * the decoding site will always ever be only readable
> 
> My initial reaction was "this wont work", but that requires
> qualification since this is a complex topic:  You can use this
> approach as long as you realise that the catalog it gives can never be
> rewound.
Well, only as far as the min recovery point has been advanced. Thats advanced 
less frequent than we apply xlog records.

> So the generic approach to "construct me a catalog as of this LSN"
> would need to start with a base backup of the catalog and then roll
> forward to the appropriate LSN. Which means a generic user of this
> approach would need to be able to construct an initial catalog using a
> PITR.

> Constructing a decoding site requires you to
> a)  take a partial base backup of the catalog
> b) apply WAL records to bring that forwards to the correct LSN, which
> would require some alteration of the recovery code to skip the files
> missing in a)
> 
> So taking the approach of a decoding site means we have to modify
> recovery code, and even when we do that we still end up with a
> difficult to deploy option in the real world. Difficult to deploy
> becaus we need a whole new instance of Postgres, plus we need all of
> the WAL files, which could easily be impractical.
> 
> The overall approach is good, but the architecture is wrong. What we
> need is a "catalog base backup" and a means of rolling forward to the
> appropriate LSN. Rolling forward using WAL is too bulky, so we need a
> separate log of DDL changes to the catalog. So what we need is a
> "catalog base backup" plus a "ddl-log".
The idea was to store the applycache to disk everytime UpdateMinRecoveryPoint 
is called. That way you wouldn't have to scroll back, even if the database 
crashes/is stopped hard.
But I agree, I don't like the architecture that much either.

> To translate the WAL we maintain a secondary set of catalog tables,
> which only exist for logical replication. These are normal,
> non-versioned tables, but held in a new schema pg_logical or similar.
> One reason why this must be a secondary catalog is to allow the
> translation to take place on the target server, and to allow
> translation of WAL from a prior version of the catalog - so we can
> allow online upgrades across different catalog versions (and possibly
> major versions).
> The secondary catalog is used in place of the normal catalog during
> InitCatalogCache() in the apply process. All the normal caches exist,
> they just point to secondary relations rather than the normal ones.
> When we initialise replication we take a copy of the appropriate
> tables, columns and rows in a catalog-base-backup, using something
> like pg_dump. Overall, this is much smaller than normal catalog since
> it avoids temp tables, and anything not related to WAL translation.
> 
> On each non-temp change to the database we record changes as SQL in
> the ddl-log, together with the LSN of the change.
> 
> When number of changes in ddl-log hits a limit we take a new
> catalog-base-backup. This process is similar to a checkpoint, but much
> less frequent, lets call it a ddl-checkpoint.
> 
> When we start to read WAL logs to translate them, we start by
> truncating/re-bootstrapping and reloading the secondary catalog from
> the base backup. We then apply all changes from the ddl-log (which is
> just a sequence of SQL statements) up until the LSN at the start of
> WAL. The secondary catalog is then an exact copy of the catalog as of
> that LSN.
> 
> As we read through WAL we apply further changes to secondary catalog
> so it maintains in lock step with the WAL we currently read.
I can't see how thats going to fly because the *_out functions use the 
syscache and also plain access to catalog tables. We would have to completely 
map oids to the alternative catalog.
For one I think that mapping would involve far too many places (shared 
catalogs/relmapper. smgr, fd, syscache, ...). For another you need to access 
those tables in a completely normal fashion from non-recovery backends which 
means that we cannot just have duplicated oids hidden away somewhere.


> Constructing the correct catalog seems to be the heart of this
> problem, so it is likely to take a while and look complex. Getting the
> HS initial state was around 50% of the effort in making it all work,
> so I guess its similar here.
Yes.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Allow WAL information to recover corrupted pg_controldata
Следующее
От: Florian Pflug
Дата:
Сообщение: Re: COMMUTATOR doesn't seem to work