Re: logical changeset generation v4 - Heikki's thoughts about the patch state

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: logical changeset generation v4 - Heikki's thoughts about the patch state
Дата
Msg-id 20130125011609.GA15706@awork2.anarazel.de
обсуждение исходный текст
Ответ на Re: logical changeset generation v4 - Heikki's thoughts about the patch state  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: logical changeset generation v4 - Heikki's thoughts about the patch state  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-hackers
Hi!

On 2013-01-24 13:27:00 -0500, Robert Haas wrote:
> On Thu, Jan 24, 2013 at 6:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

> Before getting bogged down in technical commentary, let me say this
> very clearly: I am enormously grateful for your work on this project.
> Logical replication based on WAL decoding is a feature of enormous
> value that PostgreSQL has needed for a long time, and your work has
> made that look like an achievable goal.  Furthermore, it seems to me
> that you have pursued the community process with all the vigor and
> sincerity for which anyone could ask.  Serious design concerns were
> raised early in the process and you made radical changes to the design
> which I believe have improved it tremendously, and you've continued to
> display an outstanding attitude at every phase of this process about
> which I can't say enough good things.

Very much appreciated. Especially as I can echo your feeling of not only
having positive feelings about the process ;)

> Now, the bad news is, I don't think it's very reasonable to try to
> commit this to 9.3.  I think it is just too much stuff too late in the
> cycle.  I've reviewed some of the patches from time to time but there
> is a lot more stuff and it's big and complicated and it's not really
> clear that we have the interface quite right yet, even though I think
> it's also clear that we are a lot of closer than we were.  I don't
> want to be fixing that during beta, much less after release.

It pains me to admit that you have a point there.

What I am afraid though is that it basically goes on like this in the
next commitfests:
* 9.4-CF1: no "serious" reviewer comments because they are busy doing release work
* 9.4-CF2: all are relieved that the release is over and a bit tired
* 9.4-CF3: first deeper review, some more complex restructuring required
* 9.4-CF4: too many changes to commit.

If you look at the development of the feature, after the first prototype
and the resulting design changes nobody with decision power had a more
than cursory look at the proposed interfaces. Thats very, very, very
understandable, you all are busy people and the patch & the interfaces
are complex so it takes noticeable amounts of time, but it unfortunately
doesn't help in getting an acceptable interface nailed down.

The problem with that is not only that it sucks huge amounts of energy
out of me and others but also that its very hard to really build the
layers/users above changeset extraction without being able to rely on
the interface and semantics. So we never get to the actually benefits
:(, and we don't get the users people require for the feature to be
committed.

So far, the only really effective way of getting people to comment on
patches in this state & complexity is the threat of an upcoming commit
because of the last commitfest :(

I honestly don't know how to go on about this...

> > I tried very, very hard to get the basics of the design & interface
> > solid. Which obviously doesn't man I am succeeding - luckily not being
> > superhuman after all ;). And I think thats very much where input is
> > desparetely needed and where I failed to raise enough attention. The
> > "output plugin" interface follewed by the walsender interface is what
> > needs to be most closely vetted.
> > Those are the permanent, user/developer exposed UI and the one we should
> > try to keep as stable as possible.
> >
> > The output plugin callbacks are defined here:
> >
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/include/replication/output_plugin.h;hb=xlog-decoding-rebasing-cf4
> > To make it more agnostic of the technology to implement changeset
> > extraction we possibly should replace the ReorderBuffer(TXN|Change)
> > structs being passed by something more implementation agnostic.
> >
> > walsender interface:
> >
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/backend/replication/repl_gram.y;hb=xlog-decoding-rebasing-cf4
> > The interesting new commands are:
> > 1) K_INIT_LOGICAL_REPLICATION NAME NAME
> > 2) K_START_LOGICAL_REPLICATION NAME RECPTR plugin_options
> > 3) K_FREE_LOGICAL_REPLICATION NAME
> >
> > 1 & 3 allocate (respectively free) the permanent state associated with
> > one changeset consumer whereas START_LOGICAL_REPLICATION streams out
> > changes starting at RECPTR.
> 
> Forgive me for not having looked at the patch, but to what extent is
> all this, ah, documented?

There are several mails on -hackers where I ask for input on whether
that interface is what people want but all the comments have been from
non-core pg people, although mildly favorable.

I couldn't convince myself of writing real low-level documentation
instead of just the example code I needed for testing anyway and some
more higher-level docs before I had input from that side. Perhaps that
was a mistake.

So, here's a slightly less quick overview of the walsender interface:

Whenever a new replication consumer wants to stream data we need to make
sure on the primary that the data can be provided gapless, even across
disconnects, crashes et al.
The permanent state associated with this is currently called a
"replication slot".

$ psql "port=5440 dbname=postgres replication=1"
postgres=# INIT_LOGICAL_REPLICATION 'bdr-whatever-1' 'test_decoding';replication_id | consistent_point | snapshot_name
|   plugin     
 
----------------+------------------+---------------+---------------bdr-whatever-1 | 0/3E8DFA08       | 000F54F1-1    |
test_decoding
(1 row)

So now we have allocated a permanent slot identified by the name
'bdr-whatever-1'. It also automatically exported the snapshot
'000F54F1-1' that can be imported into another transaction, e.g. to
consistently dump an initial snapshot of the data.
The information returned in the 'consistent_point' column tells us that
we will be able to return all data from that LSN onwards.

That replication slot can *only* be used for replicating changes out of
the database postgres and with the plugin 'test_decoding' (a contrib
module).

That slot will persist across restarts and everything until somebody
issues a
FREE_LOGICAL_REPLICATION 'bdr-whatever-1'.

To start streaming out changes the command
postgres=# START_LOGICAL_REPLICATION 'bdr-whatever-1' 0/3E8DFA08;
WARNING:  Starting logical replication
unexpected PQresultStatus: 8
Time: 76.346 ms

is used. Unfortunately psql isn't a suitable consumer as it cannot deal
with the unrequested copy, but thats what we have pg_receivellog for:

$ ~/.../pg_receivellog  -p 5440 -d postgres --slot bdr-whatever-1 -f - --start -v
pg_receivellog: starting log streaming at 0/0 (slot bdr-whatever-1)
pg_receivellog: initiated streaming

Which will start streaming out changes when we do:
$ psql -h /tmp -p 5440 -U andres postgres
postgres=# CREATE TABLE frak(id serial primary key, data int);
CREATE TABLE
postgres=# INSERT INTO frak (data) SELECT * FROM generate_series(1, 1);
INSERT 0 1

back to receivellog:

BEGIN 1004786
table "frak_id_seq": INSERT: sequence_name[name]:frak_id_seq last_value[int8]:1 start_value[int8]:1
increment_by[int8]:1max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0
is_cycled[bool]:fis_called[bool]:f
 
COMMIT 1004786
pg_receivellog: confirming flush up to 0/3E8F0F30 (slot bdr-whatever-1)
BEGIN 1004787
table "frak": INSERT: id[int4]:1 data[int4]:1
COMMIT 1004787
pg_receivellog: confirming flush up to 0/3E8FCDC0 (slot bdr-whatever-1)


Makes sense so far?


The actual output you see there, the 
BEGIN 1004787
table "frak": INSERT: id[int4]:1 data[int4]:1
COMMIT 1004787
bit, is generated by the test_decoding plugin referenced previously
which has functions like
extern void pg_decode_init(struct LogicalDecodingContext *ctx, bool is_init);
extern bool pg_decode_begin_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn);
extern bool pg_decode_commit_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn, XLogRecPtr commit_lsn);
extern bool pg_decode_change(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn, Oid tableoid,
ReorderBufferChange*change); 

And e.g. begin_txn looks like:

bool
pg_decode_begin_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn)
{   TestDecodingData *data = ctx->output_plugin_private;
   ctx->prepare_write(ctx, txn->lsn, txn->xid);   if (data->include_xids)       appendStringInfo(ctx->out, "BEGIN %u",
txn->xid);  else       appendStringInfoString(ctx->out, "BEGIN");   ctx->write(ctx, txn->lsn, txn->xid);   return
true;
}

As you see, it seems to have somehow gathered options from
somewhere. Those can be specified as optional argumetns to
START_LOGICAL_REPLICATION.

> > Btw, there are currently *no* changes to the wal format at all if
> > wal_format < logical except that xl_running_xacts are logged more
> > frequently which obviously could easily be made conditional. Baring bugs
> > of course.
> > The changes with wal_level>=logical aren't that big either imo:
> > * heap_insert, heap_update prevent full page writes from removing their
> >   normal record by using a separate XLogRecData block for the buffer and
> >   the record
> > * heap_delete adds more data (the pkey of the tuple) after the unchanged
> >   xl_heap_delete struct
> > * On changes to catalog tables (relfilenode, tid, cmin, cmax) are logged.
> >
> > No changes to mvcc for normal backends at all, unless you count the very
> > slightly changed *Satisfies interface (getting passed a HeapTuple
> > instead of HeapTupleHeader).
> >
> > I am not sure what you're concerned about WRT the on-disk format of the
> > tuples? We are pretty much nailed down on that due to pg_upgrade'ability
> > anyway and it could be changed from this patches POV without a problem,
> > the output plugin just sees normal HeapTuples? Or are you concerned
> > about the code extracting them from the xlog records?
>
> Mostly, my concern is that you've accidentally broken something, or
> that your code will turn out to be flaky in ways we can't now predict.

I really think a look or two from experienced enough people should make
the heapam parts safe enough.

The changes basically are like:

heap_insert(Relation relation, HeapTuple tup, CommandId cid,       xl_heap_insert xlrec;       xl_heap_header xlhdr;
  XLogRecPtr  recptr;
 
-       XLogRecData rdata[3];
+       XLogRecData rdata[4];       Page        page = BufferGetPage(buffer);       uint8       info =
XLOG_HEAP_INSERT;

+       /*
+        * For the logical replication case we need the tuple even if were
+        * doing a full page write. We could alternatively store a pointer into
+        * the fpw though.
+        * For that to work we add another rdata entry for the buffer in that
+        * case.
+        */
+       bool        need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL
+           && RelationGetRelid(relation)  >= FirstNormalObjectId;
+
+       /* For logical decode we need combocids to properly decode the catalog */
+       if (wal_level >= WAL_LEVEL_LOGICAL && RelationGetRelid(relation)  < FirstNormalObjectId)
+           log_heap_new_cid(relation, heaptup);
...       rdata[1].data = (char *) &xlhdr;       rdata[1].len = SizeOfHeapHeader;
-       rdata[1].buffer = buffer;
+       rdata[1].buffer = need_tuple_data ? InvalidBuffer : buffer;       rdata[1].buffer_std = true;
rdata[1].next= &(rdata[2]);       /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */       rdata[2].data = (char
*)heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits);       rdata[2].len = heaptup->t_len -
offsetof(HeapTupleHeaderData,t_bits);
 
-       rdata[2].buffer = buffer;
+       rdata[2].buffer = need_tuple_data ? InvalidBuffer : buffer;       rdata[2].buffer_std = true;
rdata[2].next= NULL;       /*
 
+        * add record for the buffer without actual content thats removed if
+        * fpw is done for that buffer
+        */
+       if (need_tuple_data)
+       {
+           rdata[2].next = &(rdata[3]);
+
+           rdata[3].data = NULL;
+           rdata[3].len = 0;
+           rdata[3].buffer = buffer;
+           rdata[3].buffer_std = true;
+           rdata[3].next = NULL;
+       }

Both the wal_level >= logical && XXX checks are now nicely encapsulated
but this shows the complexity of whats being done better...

Thats about all the changes that are done to heapam.c. Well, the same is
done for update, multi_insert, and delete as well, but...

> My only really specific concern at this point is about the special
> treatment of catalog tables.  We've never done anything like that
> before, and it feels like a bad idea.  In particular, the fact that
> you have to WAL-log new information about cmin/cmax really suggests
> that we're committing ourselves to the MVCC infrastructure in a way
> that we weren't previously.

It basically restores the pre 8.3 (?) state again where cmin/max were
really stored - only that it only does so temporarily instead of
permanently bloating the tables again. It imo pretty closely resembles
what the normal code is doing with combocids, just that the combocid in
this case is slightly more complex because it needs to be looked up over
a longer timeframe.
I thought about simply re-adding cmin/max storage for catalog tables,
with some trickery thats not that hard to do (store it similar to oids),
but the impact of that would have been far, far greater.

And the decision of treating only some tables that way? Well, thats a
question of overhead. There simply is no need to do something like that
for tables that aren't required for converting a HeapTuple to the format
the output wants.
From my pov its somewhat similar to the way we log differently for
temporary, persistent and unlogged tables.

>  There's some category of stuff that our
> MVCC implementation didn't previously require us to persist on disk
> which, after this, it will.  I don't understand exactly where the
> boundaries of that are in terms of future changes we might want to
> make - but I don't like moving the goalposts in that area.

I don't really see a problem there. If we decide to get rid of MVCC in a
fundamental manner, this will be the absolutely, smallest problem of it
all. IMNSHO ;)

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: Clarification of certain SQLSTATE class
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: COPY FREEZE has no warning