Обсуждение: [GENERAL] Logical decoding error
Hello,
Running Postgres 9.6.5, we're using logical decoding to take changes to the database and propagate them elsewhere in our system. We are using the PGX Go Postgres library, at https://github.com/jackc/pgx, and we are using the test_decoding plugin to format the changes. We are using 6 slots/have 6 processes streaming the changes from our database.
This setup works great, except that every 20 hours or so, some or all of the processes encounter a problem, all at the same time. They receive an unexpected message type 'w'. At this point the processes restart, and when they do, they encounter another error: "ERROR: got sequence entry 0 for toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (the chunk number/seq number varies). This causes them to restart again. They will encounter the sequence entry error up to 3 more times, before things magically start to work again.
We are also doing standard streaming replication to a slave off this database, and that has never seen a problem.
Does this ring a bell for anyone? Do you have any suggestions for how I should go about figuring out what's happening?
Thanks,
Mark
> On Nov 2, 2017, at 9:34 AM, Mark Fletcher <markf@corp.groups.io> wrote: > > Hello, > > Running Postgres 9.6.5, we're using logical decoding to take changes to the database and propagate them elsewhere in oursystem. We are using the PGX Go Postgres library, at https://github.com/jackc/pgx, and we are using the test_decodingplugin to format the changes. We are using 6 slots/have 6 processes streaming the changes from our database. > > This setup works great, except that every 20 hours or so, some or all of the processes encounter a problem, all at thesame time. They receive an unexpected message type 'w'. At this point the processes restart, and when they do, they encounteranother error: "ERROR: got sequence entry 0 for toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (thechunk number/seq number varies). This causes them to restart again. They will encounter the sequence entry error up to3 more times, before things magically start to work again. > > We are also doing standard streaming replication to a slave off this database, and that has never seen a problem. > > Does this ring a bell for anyone? Do you have any suggestions for how I should go about figuring out what's happening? Where are the errors coming from - your code or pgx? If it's from pgx, what's the exact error? ('w' is regular replicationpayload data, so it'd be expected as a copydata payload message type, but would be an error for a replicationmessage). Do you capture the raw data from the replication connection when the error happens? (If you're using pgx you might be interested in https://github.com/wttw/pgoutput - it's support for the pgoutput logicaldecoder in PG10, which might be a bit more robust to deal with than the test_decoding one). Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Thu, Nov 2, 2017 at 9:59 AM, Steve Atkins <steve@blighty.com> wrote:
Where are the errors coming from - your code or pgx? If it's from pgx, what's the exact error? ('w' is regular replication payload data, so it'd be expected as a copydata payload message type, but would be an error for a replication message).
The errors are coming from PGX. Best I can tell (and I'm not very familiar with the PGX code, nor with the internals of Postgres replication), it's waiting for a replication message from the backend, so it's expecting things like CopyData/CopyInResponse/CopyOutResponse/CopyBothResponse. It gets a 'w' msg type there, which it doesn't handle, so it errors back to me (the copydata message type it expects there is 'd').
I'm only using PGX to open a replication connection, start replication, read replication messages and send heartbeats.
I did open an issue on GitHub about this a couple of days ago, but haven't heard back from the PGX developer, so I thought I should try to help out and debug it myself. So I've been trying to educate myself on replication internals and such. Hence my email here.
Do you capture the raw data from the replication connection when the error happens?
I was not, but I have added that and some other logging, and will report back when I get more info.
Thanks,
Mark