Обсуждение: [GENERAL] Logical decoding error

Поиск
Список
Период
Сортировка

[GENERAL] Logical decoding error

От
Mark Fletcher
Дата:
Hello,

Running Postgres 9.6.5, we're using logical decoding to take changes to the database and propagate them elsewhere in our system. We are using the PGX Go Postgres library, at https://github.com/jackc/pgx, and we are using the test_decoding plugin to format the changes. We are using 6 slots/have 6 processes streaming the changes from our database.

This setup works great, except that every 20 hours or so, some or all of the processes encounter a problem, all at the same time. They receive an unexpected message type 'w'.  At this point the processes restart, and when they do, they encounter another error: "ERROR: got sequence entry 0 for toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (the chunk number/seq number varies). This causes them to restart again. They will encounter the sequence entry error up to 3 more times, before things magically start to work again.

We are also doing standard streaming replication to a slave off this database, and that has never seen a problem.

Does this ring a bell for anyone? Do you have any suggestions for how I should go about figuring out what's happening?

Thanks,
Mark

Re: [GENERAL] Logical decoding error

От
Steve Atkins
Дата:
> On Nov 2, 2017, at 9:34 AM, Mark Fletcher <markf@corp.groups.io> wrote:
>
> Hello,
>
> Running Postgres 9.6.5, we're using logical decoding to take changes to the database and propagate them elsewhere in
oursystem. We are using the PGX Go Postgres library, at https://github.com/jackc/pgx, and we are using the
test_decodingplugin to format the changes. We are using 6 slots/have 6 processes streaming the changes from our
database.
>
> This setup works great, except that every 20 hours or so, some or all of the processes encounter a problem, all at
thesame time. They receive an unexpected message type 'w'.  At this point the processes restart, and when they do, they
encounteranother error: "ERROR: got sequence entry 0 for toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)"
(thechunk number/seq number varies). This causes them to restart again. They will encounter the sequence entry error up
to3 more times, before things magically start to work again. 
>
> We are also doing standard streaming replication to a slave off this database, and that has never seen a problem.
>
> Does this ring a bell for anyone? Do you have any suggestions for how I should go about figuring out what's
happening?

Where are the errors coming from - your code or pgx? If it's from pgx, what's the exact error? ('w' is regular
replicationpayload data, so it'd be expected as a copydata payload message type, but would be an error for a
replicationmessage). 

Do you capture the raw data from the replication connection when the error happens?

(If you're using pgx you might be interested in https://github.com/wttw/pgoutput - it's support for the pgoutput
logicaldecoder in PG10, which might be a bit more robust to deal with than the test_decoding one). 

Cheers, Steve





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Logical decoding error

От
Mark Fletcher
Дата:
On Thu, Nov 2, 2017 at 9:59 AM, Steve Atkins <steve@blighty.com> wrote:

Where are the errors coming from - your code or pgx? If it's from pgx, what's the exact error? ('w' is regular replication payload data, so it'd be expected as a copydata payload message type, but would be an error for a replication message).

The errors are coming from PGX. Best I can tell (and I'm not very familiar with the PGX code, nor with the internals of Postgres replication), it's waiting for a replication message from the backend, so it's expecting things like CopyData/CopyInResponse/CopyOutResponse/CopyBothResponse. It gets a 'w' msg type there, which it doesn't handle, so it errors back to me (the copydata message type it expects there is 'd').

I'm only using PGX to open a replication connection, start replication, read replication messages and send heartbeats.

I did open an issue on GitHub about this a couple of days ago, but haven't heard back from the PGX developer, so I thought I should try to help out and debug it myself. So I've been trying to educate myself on replication internals and such. Hence my email here.


Do you capture the raw data from the replication connection when the error happens?

I was not, but I have added that and some other logging, and will report back when I get more info.


Thanks,
Mark