Обсуждение: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease

Поиск
Список
Период
Сортировка

[HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease

От
Albe Laurenz
Дата:
While playing with HEAD as of d14c85ed,
I ran into the following:

CREATE DATABASE source;
CREATE DATABASE recipient;

\c source
CREATE TABLE repli(id integer PRIMARY KEY, val text NOT NULL);
INSERT INTO repli VALUES (1, 'one');
CREATE PUBLICATION repsend FOR TABLE repli;
SELECT pg_create_logical_replication_slot('reprec', 'pgoutput');

\c recipient
CREATE TABLE repli(id integer PRIMARY KEY, val text NOT NULL);
CREATE SUBSCRIPTION reprec CONNECTION 'port=5432 dbname=source user=postgres'  PUBLICATION repsend WITH
(create_slot='false',slot_name='reprec');
 
SELECT * FROM repli;
┌────┬─────┐
│ id │ val │
├────┼─────┤
│  1 │ one │
└────┴─────┘
(1 row)

Looks good.
Now I try to produce a replication conflict:

INSERT INTO repli VALUES (2, 'block');

\c source
INSERT INTO repli VALUES (2, 'two');

After a while the server crashes, and the log looks like this:

2017-06-21 14:55:07.002 CEST [23096] ERROR:  duplicate key value violates unique constraint "repli_pkey"
2017-06-21 14:55:07.002 CEST [23096] DETAIL:  Key (id)=(2) already exists.
2017-06-21 14:55:07.003 CEST [23070] LOG:  worker process: logical replication worker for subscription 28681 (PID
23096)exited with exit code 1
 
2017-06-21 14:55:07.006 CEST [23117] LOG:  logical replication apply worker for subscription "reprec" has started
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1', publication_names
'"repsend"')
2017-06-21 14:55:07.008 CEST [23118] LOG:  starting logical decoding for slot "reprec"
2017-06-21 14:55:07.008 CEST [23118] DETAIL:  streaming transactions committing after 0/51A67B8, reading WAL from
0/51A66C8
LOG:  starting logical decoding for slot "reprec"
DETAIL:  streaming transactions committing after 0/51A67B8, reading WAL from 0/51A66C8
2017-06-21 14:55:07.008 CEST [23118] LOG:  logical decoding found consistent point at 0/51A66C8
2017-06-21 14:55:07.008 CEST [23118] DETAIL:  There are no running transactions.
DEBUG:  sending replication keepalive
LOG:  logical decoding found consistent point at 0/51A66C8
DETAIL:  There are no running transactions.
2017-06-21 14:55:07.009 CEST [23117] ERROR:  duplicate key value violates unique constraint "repli_pkey"
2017-06-21 14:55:07.009 CEST [23117] DETAIL:  Key (id)=(2) already exists.
2017-06-21 14:55:07.010 CEST [23070] LOG:  worker process: logical replication worker for subscription 28681 (PID
23117)exited with exit code 1
 
2017-06-21 14:55:12.019 CEST [23122] LOG:  logical replication apply worker for subscription "reprec" has started
DEBUG:  received replication command: IDENTIFY_SYSTEM
2017-06-21 14:55:12.021 CEST [23124] LOG:  starting logical decoding for slot "reprec"
2017-06-21 14:55:12.021 CEST [23124] DETAIL:  streaming transactions committing after 0/51A67B8, reading WAL from
0/51A66C8
DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1', publication_names
'"repsend"')
2017-06-21 14:55:12.022 CEST [23124] LOG:  logical decoding found consistent point at 0/51A66C8
2017-06-21 14:55:12.022 CEST [23124] DETAIL:  There are no running transactions.
LOG:  starting logical decoding for slot "reprec"
DETAIL:  streaming transactions committing after 0/51A67B8, reading WAL from 0/51A66C8
DEBUG:  sending replication keepalive
LOG:  logical decoding found consistent point at 0/51A66C8
DETAIL:  There are no running transactions.
2017-06-21 14:55:12.023 CEST [23122] ERROR:  duplicate key value violates unique constraint "repli_pkey"
2017-06-21 14:55:12.023 CEST [23122] DETAIL:  Key (id)=(2) already exists.
2017-06-21 14:55:12.024 CEST [23070] LOG:  worker process: logical replication worker for subscription 28681 (PID
23122)exited with exit code 1
 
2017-06-21 14:55:12.033 CEST [23124] LOG:  could not send data to client: Broken pipe
2017-06-21 14:55:12.033 CEST [23124] FATAL:  connection to client lost
2017-06-21 14:55:17.032 CEST [23133] LOG:  logical replication apply worker for subscription "reprec" has started
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1', publication_names
'"repsend"')
2017-06-21 14:57:24.552 CEST [23124] PANIC:  stuck spinlock detected at ReplicationSlotRelease, slot.c:394
2017-06-21 14:57:24.885 CEST [23070] LOG:  server process (PID 23124) was terminated by signal 6: Aborted
2017-06-21 14:57:24.885 CEST [23070] LOG:  terminating any other active server processes
2017-06-21 14:57:24.887 CEST [23134] LOG:  could not send data to client: Broken pipe
2017-06-21 14:57:24.890 CEST [23070] LOG:  all server processes terminated; reinitializing

Yours,
Laurenz Albe

Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Peter Eisentraut
Дата:
On 6/21/17 09:02, Albe Laurenz wrote:
> 2017-06-21 14:55:12.033 CEST [23124] LOG:  could not send data to client: Broken pipe
> 2017-06-21 14:55:12.033 CEST [23124] FATAL:  connection to client lost
> 2017-06-21 14:55:17.032 CEST [23133] LOG:  logical replication apply worker for subscription "reprec" has started
> DEBUG:  received replication command: IDENTIFY_SYSTEM
> DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1',
publication_names'"repsend"')
 
> 2017-06-21 14:57:24.552 CEST [23124] PANIC:  stuck spinlock detected at ReplicationSlotRelease, slot.c:394
> 2017-06-21 14:57:24.885 CEST [23070] LOG:  server process (PID 23124) was terminated by signal 6: Aborted
> 2017-06-21 14:57:24.885 CEST [23070] LOG:  terminating any other active server processes
> 2017-06-21 14:57:24.887 CEST [23134] LOG:  could not send data to client: Broken pipe
> 2017-06-21 14:57:24.890 CEST [23070] LOG:  all server processes terminated; reinitializing

I can't reproduce that.  I let it loop around for about 10 minutes and
it was fine.

I notice that you have some debug settings on.  Could you share your
exact setup steps from initdb, as well as configure options, just in
case one of these settings is causing a problem?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Albe Laurenz
Дата:
Peter Eisentraut wrote:
> On 6/21/17 09:02, Albe Laurenz wrote:
>> 2017-06-21 14:55:12.033 CEST [23124] LOG:  could not send data to client: Broken pipe
>> 2017-06-21 14:55:12.033 CEST [23124] FATAL:  connection to client lost
>> 2017-06-21 14:55:17.032 CEST [23133] LOG:  logical replication apply worker for subscription "reprec" has started
>> DEBUG:  received replication command: IDENTIFY_SYSTEM
>> DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1',
publication_names'"repsend"')
 
>> 2017-06-21 14:57:24.552 CEST [23124] PANIC:  stuck spinlock detected at ReplicationSlotRelease, slot.c:394
>> 2017-06-21 14:57:24.885 CEST [23070] LOG:  server process (PID 23124) was terminated by signal 6: Aborted
>> 2017-06-21 14:57:24.885 CEST [23070] LOG:  terminating any other active server processes
>> 2017-06-21 14:57:24.887 CEST [23134] LOG:  could not send data to client: Broken pipe
>> 2017-06-21 14:57:24.890 CEST [23070] LOG:  all server processes terminated; reinitializing
> 
> I can't reproduce that.  I let it loop around for about 10 minutes and
> it was fine.
> 
> I notice that you have some debug settings on.  Could you share your
> exact setup steps from initdb, as well as configure options, just in
> case one of these settings is causing a problem?

System is Linux, RHEL 6.9, Kernel 2.6.32-696.1.1.el6.x86_64.

Configure options:

CFLAGS='-Wall -O0 -g' ./configure --prefix=/home/laurenz/pg10 \               --sysconfdir=/magwien/etc \
--docdir=/home/laurenz/pg10/doc\               --mandir=/home/laurenz/pg10/man \               --with-perl \
  --with-tcl \               --with-tclconfig=/usr/lib64 \               --with-python \               --with-openssl \
             --with-pam \               --with-gssapi \               --enable-nls=de \
--enable-thread-safety\               --with-libxml \               --with-libxslt \               --enable-debug \
         --enable-cassert \               --with-ldap
 

$ initdb -A trust -E UTF8 --locale=de_DE.UTF-8 --lc-messages=C -T german -U postgres -W -k dbhome

$ grep -v '^[[:space:]]*#' postgresql.conf | grep -v '^$' | sed -e 's/#.*$//'

port = 5432
max_connections = 100
password_encryption = scram-sha-256
shared_buffers = 128MB
dynamic_shared_memory_type = posix
wal_level = logical
log_destination = 'stderr'
logging_collector = on
log_filename = 'postgresql-%Y-%m-%d.log'
log_rotation_size = 0
client_min_messages = debug2
log_min_messages = error
log_connections = off
log_disconnections = off
log_timezone = 'Europe/Vienna'
datestyle = 'iso, dmy'
timezone = 'Europe/Vienna'
lc_messages = 'C'
lc_monetary = 'de_DE.UTF-8'
lc_numeric = 'de_DE.UTF-8'
lc_time = 'de_DE.UTF-8'
default_text_search_config = 'pg_catalog.german'

The crash happens reliably here.

Yours,
Laurenz Albe

Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Alvaro Herrera
Дата:
Albe Laurenz wrote:
> Peter Eisentraut wrote:
> > On 6/21/17 09:02, Albe Laurenz wrote:
> >> 2017-06-21 14:55:12.033 CEST [23124] LOG:  could not send data to client: Broken pipe
> >> 2017-06-21 14:55:12.033 CEST [23124] FATAL:  connection to client lost
> >> 2017-06-21 14:55:17.032 CEST [23133] LOG:  logical replication apply worker for subscription "reprec" has started
> >> DEBUG:  received replication command: IDENTIFY_SYSTEM
> >> DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1',
publication_names'"repsend"')
 
> >> 2017-06-21 14:57:24.552 CEST [23124] PANIC:  stuck spinlock detected at ReplicationSlotRelease, slot.c:394
> >> 2017-06-21 14:57:24.885 CEST [23070] LOG:  server process (PID 23124) was terminated by signal 6: Aborted
> >> 2017-06-21 14:57:24.885 CEST [23070] LOG:  terminating any other active server processes
> >> 2017-06-21 14:57:24.887 CEST [23134] LOG:  could not send data to client: Broken pipe
> >> 2017-06-21 14:57:24.890 CEST [23070] LOG:  all server processes terminated; reinitializing
> > 
> > I can't reproduce that.  I let it loop around for about 10 minutes and
> > it was fine.
> > 
> > I notice that you have some debug settings on.  Could you share your
> > exact setup steps from initdb, as well as configure options, just in
> > case one of these settings is causing a problem?

Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
some elog(DEBUG1) calls with the slot spinlock held.  That's probably
uncool.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Peter Eisentraut
Дата:
On 6/23/17 13:26, Alvaro Herrera wrote:
> Albe Laurenz wrote:
>> Peter Eisentraut wrote:
>>> On 6/21/17 09:02, Albe Laurenz wrote:
>>>> 2017-06-21 14:55:12.033 CEST [23124] LOG:  could not send data to client: Broken pipe
>>>> 2017-06-21 14:55:12.033 CEST [23124] FATAL:  connection to client lost
>>>> 2017-06-21 14:55:17.032 CEST [23133] LOG:  logical replication apply worker for subscription "reprec" has started
>>>> DEBUG:  received replication command: IDENTIFY_SYSTEM
>>>> DEBUG:  received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1',
publication_names'"repsend"')
 
>>>> 2017-06-21 14:57:24.552 CEST [23124] PANIC:  stuck spinlock detected at ReplicationSlotRelease, slot.c:394
>>>> 2017-06-21 14:57:24.885 CEST [23070] LOG:  server process (PID 23124) was terminated by signal 6: Aborted
>>>> 2017-06-21 14:57:24.885 CEST [23070] LOG:  terminating any other active server processes
>>>> 2017-06-21 14:57:24.887 CEST [23134] LOG:  could not send data to client: Broken pipe
>>>> 2017-06-21 14:57:24.890 CEST [23070] LOG:  all server processes terminated; reinitializing
>>>
>>> I can't reproduce that.  I let it loop around for about 10 minutes and
>>> it was fine.
>>>
>>> I notice that you have some debug settings on.  Could you share your
>>> exact setup steps from initdb, as well as configure options, just in
>>> case one of these settings is causing a problem?
> 
> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
> uncool.

I can reproduce the issue with --client-min-messages=debug2 or debug1,
but it doesn't appear with the default settings.  I don't always get the
"stuck spinlock" message, but it hangs badly pretty reliably after two
or three rounds of erroring.

Removing the call you pointed out doesn't make a difference, but it's
possibly something similar.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Andres Freund
Дата:
On 2017-06-23 13:26:58 -0400, Alvaro Herrera wrote:
> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
> uncool.

It definitely is not cool, rather daft even (it's probably me who wrote
that).  No idea why I've done it like that, makes no sense.

Andres Freund



Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 6/23/17 13:26, Alvaro Herrera wrote:
>> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
>> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
>> uncool.

> I can reproduce the issue with --client-min-messages=debug2 or debug1,
> but it doesn't appear with the default settings.  I don't always get the
> "stuck spinlock" message, but it hangs badly pretty reliably after two
> or three rounds of erroring.

> Removing the call you pointed out doesn't make a difference, but it's
> possibly something similar.

Maybe some function invoked in a debug ereport accesses shared memory
or takes locks (eg, for catalog access)?
        regards, tom lane



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Peter Eisentraut
Дата:
On 6/23/17 16:10, Peter Eisentraut wrote:
>> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
>> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
>> uncool.

> Removing the call you pointed out doesn't make a difference, but it's
> possibly something similar.

Correction: The fault actually is in that function.  I had only looked
at the "if" branch, but it was the "else" branch that triggered the
problem in this run.  If I remove those elog calls, everything works
smoothly.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Peter Eisentraut
Дата:
On 6/23/17 16:15, Andres Freund wrote:
> On 2017-06-23 13:26:58 -0400, Alvaro Herrera wrote:
>> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
>> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
>> uncool.
> 
> It definitely is not cool, rather daft even (it's probably me who wrote
> that).  No idea why I've done it like that, makes no sense.

Do you want to take a look at move those elog calls around a bit?  That
should do it.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Do you want to take a look at move those elog calls around a bit?  That
> should do it.

It would be a good idea to have some clarity on *why* that should do it.

Looking at the original report's log, but without having actually
reproduced the problem, I guess what is happening is this:

1. Subscription worker process (23117) gets a duplicate key conflict while
trying to apply an update, and in consequence it exits.  (Is that supposed
to happen?)

2. Publication server process (23124) doesn't notice client connection
loss right away.  By chance, the next thing it tries to send to the client
is the debug output from LogicalIncreaseRestartDecodingForSlot.  Then it
detects loss of connection (at 2017-06-21 14:55:12.033) and FATAL's out.
But since the spinlock stuff has no tracking infrastructure, we don't
know we are still holding the replication slot mutex.

3. Process exit cleanup does know that it's supposed to release the
replication slot, so it tries to take the mutex spinlock ... again.
Eventually that times out and we get the "stuck spinlock" panic.

All correct so far?

So, okay, the proximate cause of the crash is a blatant violation of the
rule that spinlocks may only be held across straight-line code segments.
But I'm wondering about the client exit having occurred in the first
place.  Why is that, and how would one ever recover?  It sure looks
like this isn't the first subscription worker process that has tried
and failed to apply the update.  If our attitude towards this situation is
that it's okay to fork-bomb your server with worker processes continually
respawning and making no progress, well, I don't think that's good enough.
        regards, tom lane



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Petr Jelinek
Дата:
On 24/06/17 04:50, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Do you want to take a look at move those elog calls around a bit?  That
>> should do it.
> 
> It would be a good idea to have some clarity on *why* that should do it.
> 
> Looking at the original report's log, but without having actually
> reproduced the problem, I guess what is happening is this:
> 
> 1. Subscription worker process (23117) gets a duplicate key conflict while
> trying to apply an update, and in consequence it exits.  (Is that supposed
> to happen?)
> 
> 2. Publication server process (23124) doesn't notice client connection
> loss right away.  By chance, the next thing it tries to send to the client
> is the debug output from LogicalIncreaseRestartDecodingForSlot.  Then it
> detects loss of connection (at 2017-06-21 14:55:12.033) and FATAL's out.
> But since the spinlock stuff has no tracking infrastructure, we don't
> know we are still holding the replication slot mutex.
> 
> 3. Process exit cleanup does know that it's supposed to release the
> replication slot, so it tries to take the mutex spinlock ... again.
> Eventually that times out and we get the "stuck spinlock" panic.
> 
> All correct so far?

Sounds about right.

> So, okay, the proximate cause of the crash is a blatant violation of the
> rule that spinlocks may only be held across straight-line code segments.
> But I'm wondering about the client exit having occurred in the first
> place.  Why is that, and how would one ever recover?  It sure looks
> like this isn't the first subscription worker process that has tried
> and failed to apply the update.  If our attitude towards this situation is
> that it's okay to fork-bomb your server with worker processes continually
> respawning and making no progress, well, I don't think that's good enough.
> 

Well, we don't have conflict detection/handling in PG10 like for example
pglogical does. Even once we'll have that it will not be able to resolve
multiple unique index violations probably (there is no obvious way how
to do that automatically). And we can't really progress when there is an
unresolved constraint violation. To recover one has to either remove the
conflicting row on downstream or remove the transaction from replication
upstream by manually consuming it using
pg_logical_slot_get_binary_changes. Now that's arguably somewhat ugly
interface to do it, we might want to invent nicer interface for that
even for PG10, but it would mean catalog bump so it should be rather
soon if we'll go there.

As for fork-bombing, it should be very slow fork bomb (we rate-limit the
worker starting) but it's not ideal situation I agree with that. I am
open to suggestions what we can do there, if we had some kind of list of
non-recoverable errors we could automatically disable the subscription
on them (although we need to be able to modify the catalog for that
which may not be possible in an unrecoverable error) but it's not clear
to me how to reasonably produce such a list.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Logical replication: stuck spinlock atReplicationSlotRelease

От
Peter Eisentraut
Дата:
On 6/23/17 20:58, Peter Eisentraut wrote:
> On 6/23/17 16:15, Andres Freund wrote:
>> On 2017-06-23 13:26:58 -0400, Alvaro Herrera wrote:
>>> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
>>> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
>>> uncool.
>>
>> It definitely is not cool, rather daft even (it's probably me who wrote
>> that).  No idea why I've done it like that, makes no sense.
> 
> Do you want to take a look at move those elog calls around a bit?  That
> should do it.

This hasn't been fixed yet.  How should we do it?  Just move the
spinlock release before the elog() calls, thus risking reading some data
unprotected, or copy the necessary things into local variables?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services