Re: Changeset Extraction v7.1

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Changeset Extraction v7.1
Дата
Msg-id CA+TgmobF6_FzU411MCrBYrUJd6Pf3OFPtBT+CA61TtNt9t+1RA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Changeset Extraction v7.1  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: Changeset Extraction v7.1  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Changeset Extraction v7.1  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
Patch 0001:

+                                errmsg("could not find free replication slot"),

Suggest: all replication slots are in use

+               elog(ERROR, "cannot aquire a slot while another slot
has been acquired");

Suggest: this backend has already acquired a replication slot

Or demote it to Assert().  I'm not really sure why this needs to be
checked in non-assert builds.  I also wonder if we should use the
terminology "attach" instead of "acquire"; that pairs more naturally
with "release".  Then the message, if we want more than an assert,
might be "this backend is already attached to a replication slot".

+       if (slot == NULL)
+       {
+               LWLockRelease(ReplicationSlotCtlLock);
+               ereport(ERROR,
+                               (errcode(ERRCODE_UNDEFINED_OBJECT),
+                                errmsg("could not find replication
slot \"%s\"", name)));
+       }

The error will release the LWLock anyway; I'd get rid of the manual
LWLockRelease, and the braces.  Similarly in ReplicationSlotDrop.

+       /* acquire spinlock so we can test and set ->active safely */
+       SpinLockAcquire(&slot->mutex);
+
+       if (slot->active)
+       {
+               SpinLockRelease(&slot->mutex);
+               LWLockRelease(ReplicationSlotCtlLock);
+               ereport(ERROR,
+                               (errcode(ERRCODE_OBJECT_IN_USE),
+                                errmsg("slot \"%s\" already active", name)));
+       }
+
+       /* we definitely have the slot, no errors possible anymore */
+       slot->active = true;
+       MyReplicationSlot = slot;
+       SpinLockRelease(&slot->mutex);

This doesn't need the LWLockRelease either.  It does need the
SpinLockRelease, but as I think I noted previously, the right way to
write this is probably: SpinLockAcquire(&slot->mutex); was_active =
slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
(was_active) ereport(). MyReplicatonSlot = slot.

ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
the comment "Provide interlock against concurrent recomputations"
doesn't seem adequate to me.  I guess the idea here is that we regard
ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
ReplicationSlotCtl->data_xmin, but if that's the idea then we only
need to hold the lock during the time when we actually update those
values, not the loop where we compute them.  Also, if that's the
design, maybe they should be part of PROC_HDR *ProcGlobal rather than
here.  It seems weird to have some of the values protected by
ProcArrayLock live in a completely different data structure managed
almost entirely by some other part of the system.

It's pretty evident that what's currently patch #4 (only peg the xmin
horizon for catalog tables during logical decoding) needs to become
patch #1, because it doesn't make any sense to apply this before we do
that.  I'm still not 100% confident in that approach, but maybe I'd
better try to look at it RSN and get confident, because too much of
the rest of what you've got here hangs on that to proceed without it.
Or to put all that another way, if for any reason we decide that the
separate catalog xmin stuff is not viable, the rest of this is going
to need a lot of rework, so we'd better sort that now rather than
later.

With respect to the synchronize-slots-to-disk stuff we're arguing
about on the other thread, I think the basic design problem here is
that you assume that you can change stuff in memory and then change
stuff on disk, without either set of changes being atomic.  What I
think you need to do is making atomic actions on disk correspond to
atomic state changes in memory.  IOW, suppose that creating a slot
consists of two steps: mkdir() + fsync().  Then I think what you need
to do is - do the mkdir().  If it errors out, fine.  If it succeeds,
the mark the slot half-created.  This is just an assignment so it can
done immediately after you learn that mkdir() worked with no risk of
an intervening failure.  Then, try to fsync().  If it errors out, the
slot will get left in the half-created state.  If it works, then
immediately mark the slot as fully created.  Now, when the next guy
comes along and looks at the slot, he can tell what he needs to do.
Specifically, if the slot is half-created, and he wants to do anything
other than remove it, he's got to fsync() it first, and if that errors
out, so be it.  The next access to the slot will merely find it still
half-created and simply try the fsync() yet again.

Alternatively, since nearly everything we're trying to do here is a
two-step operation - do something and then fsync - maybe we have a
more generic fsync-pending flag, and each slot operation checks that
and retries the fsync() if it's set.  But it might be situation
dependent which thing we need to fsync, since there are multiple files
involved.

Broadly, what you're trying to accomplish here is to have something
that is crash-safe but without relying on WAL, so that it can work on
standbys.  If making things crash-safe without WAL were easy to do, we
probably wouldn't have WAL at all, so it stands to reason that there
are going to be some difficulties here.  Making it work reliably is
going to require either inventing some special-purpose type of
write-ahead logging specific to this particular need, or some analogue
of shadow paging, or making sure that every intermediate step is
well-defined and recoverable.  Right now, you're on that last path,
and it's by no means obvious to me that that's the wrong place to be,
but I think there's some work left to be done to get it there.

Calling a slot "old" or "new" looks liable to cause problems.  Maybe
change those names to contain a character not allowed in a slot name,
if we're going to keep doing it that way.

I wonder if it wouldn't be better to get rid of the subdirectories for
the individual slots, and just have a file pg_replslot/$SLOTNAME, or
not.  I know there are later patches that need subdirectories for
their own private data, but they could just create
pg_replslot/$SLOTNAME.dir and put whatever in it they like, without
really implicating this code that much.  The advantage of that is that
there would be fewer intermediate states.  The slot exists if the file
exists, and not if it doesn't.  You still need half-alive and
half-dead until the fsync finishes, but you don't need to worry about
tracking both the state of the directory and the state of the file.
On startup we fsync the containing directory and all of the slot files
we find inside it and refuse to start up if that fails, but once
running filesystem failures only prevent changes; they don't kill the
system.

Patch 0004:

I'm not very confident that PROC_IN_LOGICAL_DECODING is the right way
to go here.  It seems to me that excluding the xmins of backends with
slots from globalxmin consideration so that we can fold the same xmin
in by some other mechanism is kind of strange.  How about letting the
xmins of such backends affect the computation as normal, and then
having one extra xmin that gets folded in that represents the minima
of the xmin of unconnected slots?  When a backend with a slot
disconnects, an on_shmem_exit hook must move that value backwards if
it follows MyPgXact->xmin.

That's all for now...

...Robert



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Kirkwood
Дата:
Сообщение: Re: Why do we let autovacuum give up?
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Compress GIN posting lists, for smaller index size.