Re: Minimal logical decoding on standbys

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Minimal logical decoding on standbys
Дата
Msg-id 20190405231553.njjcu7igoyuk2hx5@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Minimal logical decoding on standbys  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Ответы Re: Minimal logical decoding on standbys  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Re: Minimal logical decoding on standbys  (Craig Ringer <craig@2ndquadrant.com>)
Список pgsql-hackers
Hi,

Thanks for the new version of the patch.  Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.

On 2019-04-05 17:08:39 +0530, Amit Khandekar wrote:
> Regarding the test result failures, I could see that when we drop a
> logical replication slot at standby server, then the catalog_xmin of
> physical replication slot becomes NULL, whereas the test expects it to
> be equal to xmin; and that's the reason a couple of test scenarios are
> failing :
> 
> ok 33 - slot on standby dropped manually
> Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
> done
> not ok 34 - physical catalog_xmin still non-null
> not ok 35 - xmin and catalog_xmin equal after slot drop
> #   Failed test 'xmin and catalog_xmin equal after slot drop'
> #   at t/016_logical_decoding_on_replica.pl line 272.
> #          got:
> #     expected: 2584
> 
> I am not sure what is expected. What actually happens is : the
> physical xlot catalog_xmin remains NULL initially, but becomes
> non-NULL after the logical replication slot is created on standby.

That seems like the correct behaviour to me - why would we still have a
catalog xmin if there's no slot logical slot?


> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index 006446b..5785d2f 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
>      }
>  }
>  
> +void
> +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> +{
> +    int            i;
> +    bool        found_conflict = false;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +restart:
> +    if (found_conflict)
> +    {
> +        CHECK_FOR_INTERRUPTS();
> +        /*
> +         * Wait awhile for them to die so that we avoid flooding an
> +         * unresponsive backend when system is heavily loaded.
> +         */
> +        pg_usleep(100000);
> +        found_conflict = false;
> +    }
> +
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData    slotname;
> +        TransactionId slot_xmin;
> +        TransactionId slot_catalog_xmin;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* not our database, skip */
> +        if (s->data.database != InvalidOid && s->data.database != dboid)
> +            continue;
> +
> +        SpinLockAcquire(&s->mutex);
> +        slotname = s->data.name;
> +        slot_xmin = s->data.xmin;
> +        slot_catalog_xmin = s->data.catalog_xmin;
> +        SpinLockRelease(&s->mutex);
> +
> +        if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> +        {
> +            found_conflict = true;
> +
> +            ereport(WARNING,
> +                    (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> +                            NameStr(slotname), slot_xmin, xid)));
> +        }
> +
> +        if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> +        {
> +            found_conflict = true;
> +
> +            ereport(WARNING,
> +                    (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> +                            NameStr(slotname), slot_catalog_xmin, xid)));
> +        }
> +
> +
> +        if (found_conflict)
> +        {
> +            elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
> +            LWLockRelease(ReplicationSlotControlLock);    /* avoid deadlock */
> +            ReplicationSlotDropPtr(s);
> +
> +            /* We released the lock above; so re-scan the slots. */
> +            goto restart;
> +        }
> +    }
>
I think this should be refactored so that the two found_conflict cases
set a 'reason' variable (perhaps an enum?) to the particular reason, and
then only one warning should be emitted.  I also think that LOG might be
more appropriate than WARNING - as confusing as that is, LOG is more
severe than WARNING (see docs about log_min_messages).


> @@ -0,0 +1,386 @@
> +# Demonstrate that logical can follow timeline switches.
> +#
> +# Test logical decoding on a standby.
> +#
> +use strict;
> +use warnings;
> +use 5.8.0;
> +
> +use PostgresNode;
> +use TestLib;
> +use Test::More tests => 55;
> +use RecursiveCopy;
> +use File::Copy;
> +
> +my ($stdin, $stdout, $stderr, $ret, $handle, $return);
> +my $backup_name;
> +
> +# Initialize master node
> +my $node_master = get_new_node('master');
> +$node_master->init(allows_streaming => 1, has_archiving => 1);
> +$node_master->append_conf('postgresql.conf', q{
> +wal_level = 'logical'
> +max_replication_slots = 4
> +max_wal_senders = 4
> +log_min_messages = 'debug2'
> +log_error_verbosity = verbose
> +# send status rapidly so we promptly advance xmin on master
> +wal_receiver_status_interval = 1
> +# very promptly terminate conflicting backends
> +max_standby_streaming_delay = '2s'
> +});
> +$node_master->dump_info;
> +$node_master->start;
> +
> +$node_master->psql('postgres', q[CREATE DATABASE testdb]);
> +
> +$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
> +$backup_name = 'b1';
> +my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
> +TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'),
'--slot=decoding_standby');
> +
> +sub print_phys_xmin
> +{
> +    my $slot = $node_master->slot('decoding_standby');
> +    return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
> +}
> +
> +my ($xmin, $catalog_xmin) = print_phys_xmin();
> +# After slot creation, xmins must be null
> +is($xmin, '', "xmin null");
> +is($catalog_xmin, '', "catalog_xmin null");
> +
> +my $node_replica = get_new_node('replica');
> +$node_replica->init_from_backup(
> +    $node_master, $backup_name,
> +    has_streaming => 1,
> +    has_restoring => 1);
> +$node_replica->append_conf('postgresql.conf',
> +    q[primary_slot_name = 'decoding_standby']);
> +
> +$node_replica->start;
> +$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
> +
> +# with hot_standby_feedback off, xmin and catalog_xmin must still be null
> +($xmin, $catalog_xmin) = print_phys_xmin();
> +is($xmin, '', "xmin null after replica join");
> +is($catalog_xmin, '', "catalog_xmin null after replica join");
> +
> +$node_replica->append_conf('postgresql.conf',q[
> +hot_standby_feedback = on
> +]);
> +$node_replica->restart;
> +sleep(2); # ensure walreceiver feedback sent

Can we make this more robust? E.g. by waiting till pg_stat_replication
shows the change on the primary? Because I can guarantee that this'll
fail on slow buildfarm machines (say the valgrind animals).




> +$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
> +sleep(2); # ensure walreceiver feedback sent

Similar.


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Intermittent failure in InstallCheck-C "stat" test
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Ordered Partitioned Table Scans