Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Поиск

Список

Период

Сортировка

От	Petr Jelinek
Тема	Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Дата	12 октября 2020 г. 16:20:47
Msg-id	6decb65b-aa4f-2b9a-1b53-7246cebe481a@2ndquadrant.com обсуждение исходный текст
Ответ на	Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список	pgsql-bugs

Дерево обсуждения

Hi,

On 10/10/2020 21:06, Alvaro Herrera wrote:
> On 2020-Sep-30, Tom Lane wrote:
> 
>> After some digging around, I realize that that commit actually
>> resulted in a protocol break.  libpqwalreceiver is expecting to
>> get an additional CommandComplete message after COPY OUT finishes,
>> per libpqrcv_endstreaming(), and it's no longer getting one.
>>
>> (I have not read the protocol document to see if this is per spec;
>> but spec or no, that's what libpqwalreceiver is expecting.)
>>
>> The question that this raises is how the heck did that get past
>> our test suites?  It seems like the error should have been obvious
>> to even the most minimal testing.
> 
> So I think I can answer this now.  Petr, as intellectual author of this
> code I would appreciate your thoughts on this -- you probably understand
> this much better.  Same goes for Peter as committer.
> 
> I think the logical replication feature is missing an detailed
> architectural diagram.
> 
> What is happening is that the tablesync code, which is in charge of
> initial syncing of tables, has two main code paths: the first one is an
> optimization (?) in LogicalRepSyncTableStart() that if the sync takes
> little time, the process will exit immediately (via
> finish_sync_worker()) without returning control to the main logical
> replication apply worker code.  In the TAP tests, all tables are pretty
> small, so that optimization always fires.

It's not only about size of the tables, it's mainly that there is no 
write activity so the main apply is not moving past the LSN at which 
table sync has started at. With bigger table you get at least something 
written (running xacts, autovacuum, or whatever) that moves lsn forward 
eventually.

> 
> So in that case we never get control back to walrcv_endstreaming (the
> complainant in Henry's test case), which is why replication never fails
> in the tests.  You can verify this because the final "return slotname"
> line in LogicalRepSyncTableStart has zero coverage.
> 

Correct.

> 
> The other code path returns the slot name to ApplyWorkerMain, which
> does the normal walrcv_startstreaming / LogicalRepApplyLoop /
> endstreaming dance.  And whenever that code path is taken, all the
> src/test/subscription tests fail with the missing command tag, and
> they work correctly when the command tag is restored, which is what
> we wanted.
> 
> My proposal at this stage is to remove the optimization in
> LogicalRepSyncTableStart, per 0001, if only because the code becomes
> simpler (0002 reindents code).  With this, we get a coverage increase
> for tablesync.c not only in the final "return slotname" but also for
> process_syncing_tables_for_sync() which is currently uncovered:
> https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268
> 
> However, and this is one reason why I'd welcome Petr/Peter thoughts on
> this, I don't really understand what happens in LogicalRepApplyLoop
> afterwards with a tablesync worker; are we actually doing anything
> useful there, considering that the actual data copy seems to have
> occurred in the CopyFrom() call in copy_table?  In other words, by the
> time we return control to ApplyWorkerMain with a slot name, isn't the
> work all done, and the only thing we need is to synchronize protocol and
> close the connection?
> 

There are 2 possible states at that point, either tablesync is ahead 
(when main apply lags or nothing is happening on publication side) or 
it's behind the main apply. When tablesync is ahead we are indeed done 
and just need to update the state of the table (which is what the code 
you removed did, but LogicalRepApplyLoop should do it as well, just a 
bit later). When it's behind we need to do catchup for that table only 
which still happens in the tablesync worker. See the explanation at the 
beginning of tablesync.c, it probably needs some small adjustments after 
the changes in your first patch.

> 
> Another thing I noticed is that if I crank up the number of rows in the
> new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
> in parallel.  What's going on with that?  Shouldn't there be two
> simultaneous workers?
>

That's likely because the second replication slot is waiting for xid 
created by the first one to commit or abort while building consistent 
snapshot.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop