Обсуждение: No such file or directory in pg_replslot

Поиск
Список
Период
Сортировка

No such file or directory in pg_replslot

От
Jeremy Finzel
Дата:
I don't know if this applies only to pglogical or logical decoding in general.  This is on a 9.6.10 provider running pglogical 2.2.0.  Subscriber has same versions.  We had a replication delay situation this morning, which I think may have been due to a really long transaction but I've yet to verify that.

I disabled and re-enabled replication and at one point, this created an error on start_replication_slot that the pid was already active.

Somehow replication got wedged and now even though replication appears to be working, strace shows these kinds of errors continually:
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F4000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F5000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F6000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F7000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F8000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F9000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FA000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FB000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FC000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FD000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FE000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)

Any suggestions?  This is a showstopper for us.

Thank you,
Jeremy

Re: No such file or directory in pg_replslot

От
Andres Freund
Дата:

On December 8, 2018 9:08:09 AM PST, Jeremy Finzel <finzelj@gmail.com> wrote:
>I don't know if this applies only to pglogical or logical decoding in
>general.  This is on a 9.6.10 provider running pglogical 2.2.0.
>Subscriber
>has same versions.  We had a replication delay situation this morning,
>which I think may have been due to a really long transaction but I've
>yet
>to verify that.
>
>I disabled and re-enabled replication and at one point, this created an
>error on start_replication_slot that the pid was already active.
>
>Somehow replication got wedged and now even though replication appears
>to
>be working, strace shows these kinds of errors continually:
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F4000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F5000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F6000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F7000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F8000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F9000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FA000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FB000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FC000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FD000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FE000000.snap",
>O_RDONLY) = -1 ENOENT (No such file or directory)
>
>Any suggestions?  This is a showstopper for us.

That doesn't indicate an error.  You need to provide more details what made you consider things wedged...

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: No such file or directory in pg_replslot

От
Jeremy Finzel
Дата:
That doesn't indicate an error.  You need to provide more details what made you consider things wedged...

Andres

Thank you very much for the reply.  We typically see no visible replication delay over 5 minutes ever.  Today we saw a delay of over 3 hours, and no obvious increase in workload either on the provider or the subscriber.  I also did not see the LSN advancing whatsoever in terms of applying changes.

I first checked for long-running transactions on the master but there was nothing too unusual except an ANALYZE which I promptly killed, but with no improvement to the situation.

I found the messages above using strace after canceling the subscription and finding that the process was taking extremely long to cancel.  There are 2.1 million files in pg_replslot which I don't think is normal?  Any ideas as to where I should be looking or what could cause this?

Thanks,
Jeremy


Re: No such file or directory in pg_replslot

От
Jeremy Finzel
Дата:


On Sat, Dec 8, 2018 at 1:21 PM Jeremy Finzel <finzelj@gmail.com> wrote:
That doesn't indicate an error.  You need to provide more details what made you consider things wedged...

Andres

Thank you very much for the reply.  We typically see no visible replication delay over 5 minutes ever.  Today we saw a delay of over 3 hours, and no obvious increase in workload either on the provider or the subscriber.  I also did not see the LSN advancing whatsoever in terms of applying changes.

I first checked for long-running transactions on the master but there was nothing too unusual except an ANALYZE which I promptly killed, but with no improvement to the situation.

I found the messages above using strace after canceling the subscription and finding that the process was taking extremely long to cancel.  There are 2.1 million files in pg_replslot which I don't think is normal?  Any ideas as to where I should be looking or what could cause this?

Thanks,
Jeremy

I have very good news in that waiting it out for several hours, it resolved itself.  Thank you, your input steered us in the right direction!

Jeremy
 

Re: No such file or directory in pg_replslot

От
Adrien NAYRAT
Дата:
On 12/8/18 8:21 PM, Jeremy Finzel wrote:
> There are 2.1 million files in pg_replslot which I don't think is 
> normal?  Any ideas as to where I should be looking or what could cause this?

Postgres spills changes on disk when you have a big transaction: 
https://blog.anayrat.info/en/2018/03/10/logical-replication-internals/

You can monitor it with check_pgactivity's replication_slots service:
https://github.com/OPMDG/check_pgactivity/blob/master/check_pgactivity#L5664

(You have to use master version, this feature has not been released yet)