Re: Catalog/Metadata consistency during changeset extraction from wal
От | Andres Freund |
---|---|
Тема | Re: Catalog/Metadata consistency during changeset extraction from wal |
Дата | |
Msg-id | 201206221142.10603.andres@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: Catalog/Metadata consistency during changeset extraction from wal (Simon Riggs <simon@2ndquadrant.com>) |
Список | pgsql-hackers |
Hi, On Friday, June 22, 2012 08:48:41 AM Simon Riggs wrote: > On 21 June 2012 12:41, Andres Freund <andres@2ndquadrant.com> wrote: > > 2.) > > Keep the decoding site up2date by replicating the catalog via normal HS > > recovery > > mechanisms. > > > > Advantages: > > * most of the technology is already there > > * minimal overhead (space, performance) > > * no danger of out of sync catalogs > > * no support for command triggers required that can keep a catalog in > > sync, including oids > > > > Disadvantages: > > * driving the catalog recovery that way requires some somewhat intricate > > code as it needs to be done in lockstep with decoding the wal-stream > > * requires an additional feature to guarantee HS always has enough > > information to be queryable after a crash/shutdown > > * some complex logic/low-level fudging required to keep the transactional > > behaviour sensible when querying the catalog > > * full version/architecture compatibility required > > * the decoding site will always ever be only readable > > My initial reaction was "this wont work", but that requires > qualification since this is a complex topic: You can use this > approach as long as you realise that the catalog it gives can never be > rewound. Well, only as far as the min recovery point has been advanced. Thats advanced less frequent than we apply xlog records. > So the generic approach to "construct me a catalog as of this LSN" > would need to start with a base backup of the catalog and then roll > forward to the appropriate LSN. Which means a generic user of this > approach would need to be able to construct an initial catalog using a > PITR. > Constructing a decoding site requires you to > a) take a partial base backup of the catalog > b) apply WAL records to bring that forwards to the correct LSN, which > would require some alteration of the recovery code to skip the files > missing in a) > > So taking the approach of a decoding site means we have to modify > recovery code, and even when we do that we still end up with a > difficult to deploy option in the real world. Difficult to deploy > becaus we need a whole new instance of Postgres, plus we need all of > the WAL files, which could easily be impractical. > > The overall approach is good, but the architecture is wrong. What we > need is a "catalog base backup" and a means of rolling forward to the > appropriate LSN. Rolling forward using WAL is too bulky, so we need a > separate log of DDL changes to the catalog. So what we need is a > "catalog base backup" plus a "ddl-log". The idea was to store the applycache to disk everytime UpdateMinRecoveryPoint is called. That way you wouldn't have to scroll back, even if the database crashes/is stopped hard. But I agree, I don't like the architecture that much either. > To translate the WAL we maintain a secondary set of catalog tables, > which only exist for logical replication. These are normal, > non-versioned tables, but held in a new schema pg_logical or similar. > One reason why this must be a secondary catalog is to allow the > translation to take place on the target server, and to allow > translation of WAL from a prior version of the catalog - so we can > allow online upgrades across different catalog versions (and possibly > major versions). > The secondary catalog is used in place of the normal catalog during > InitCatalogCache() in the apply process. All the normal caches exist, > they just point to secondary relations rather than the normal ones. > When we initialise replication we take a copy of the appropriate > tables, columns and rows in a catalog-base-backup, using something > like pg_dump. Overall, this is much smaller than normal catalog since > it avoids temp tables, and anything not related to WAL translation. > > On each non-temp change to the database we record changes as SQL in > the ddl-log, together with the LSN of the change. > > When number of changes in ddl-log hits a limit we take a new > catalog-base-backup. This process is similar to a checkpoint, but much > less frequent, lets call it a ddl-checkpoint. > > When we start to read WAL logs to translate them, we start by > truncating/re-bootstrapping and reloading the secondary catalog from > the base backup. We then apply all changes from the ddl-log (which is > just a sequence of SQL statements) up until the LSN at the start of > WAL. The secondary catalog is then an exact copy of the catalog as of > that LSN. > > As we read through WAL we apply further changes to secondary catalog > so it maintains in lock step with the WAL we currently read. I can't see how thats going to fly because the *_out functions use the syscache and also plain access to catalog tables. We would have to completely map oids to the alternative catalog. For one I think that mapping would involve far too many places (shared catalogs/relmapper. smgr, fd, syscache, ...). For another you need to access those tables in a completely normal fashion from non-recovery backends which means that we cannot just have duplicated oids hidden away somewhere. > Constructing the correct catalog seems to be the heart of this > problem, so it is likely to take a while and look complex. Getting the > HS initial state was around 50% of the effort in making it all work, > so I guess its similar here. Yes. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Amit KapilaДата:
Сообщение: Re: Allow WAL information to recover corrupted pg_controldata