Обсуждение: Streaming replication - invalid resource manager ID

Поиск
Список
Период
Сортировка

Streaming replication - invalid resource manager ID

От
"Maeldron T."
Дата:
I’m a bit stuck with a few LOG messages. The one in the subject disturbs me the most.

(FreeBSD 11.2, PostgreSQL 10.6)

When I restart the hot standby, sometimes I see

2019-02-13 21:17:31 CET LOG:  redo starts at 0/488A2748
2019-02-13 21:17:31 CET LOG:  invalid record length at 0/488A2780: wanted 24, got 0
2019-02-13 21:17:31 CET LOG:  database system is ready to accept read only connections
2019-02-13 21:17:31 CET LOG:  started streaming WAL from primary at 0/48000000 on timeline 1

I try to believe it is harmless. I found two emails in the archive telling that this is harmless, but they were from many years ago. Is it really harmless? (Without PANIC, etc). The main reason I believe it’s harmless is that I often see it.

The one I see less often is: "invalid resource manager ID"

The last time I encountered it when I made the standby the new master, and the original master the new standby. Steps:

1. Stop the master (service postgresql stop)

Log from the original master:
2019-02-13 20:19:12 CET  LOG:  received fast shutdown request
2019-02-13 20:19:12 CET  LOG:  aborting any active transactions
2019-02-13 20:19:12 CET  LOG:  worker process: logical replication launcher (PID 17828) exited with exit code 1
2019-02-13 20:19:12 CET  LOG:  shutting down
2019-02-13 20:19:16 CET  LOG:  database system is shut down

Log from the new master (it was the standby):
2019-02-13 20:19:16 CET  LOG:  replication terminated by primary server
2019-02-13 20:19:16 CET  DETAIL:  End of WAL reached on timeline 3 at 8F42/645FBD00.
2019-02-13 20:19:16 CET  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
2019-02-13 20:19:16 CET  LOG:  invalid record length at 8F42/645FBD00: wanted 24, got 0
2019-02-13 20:19:16 CET  FATAL:  could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

2. pg_ctl -D the_whole_galaxy promote

Log from the new master (was slave):
2019-02-13 20:19:31 CET  LOG:  received promote request
2019-02-13 20:19:31 CET  FATAL:  terminating walreceiver process due to administrator command
2019-02-13 20:19:31 CET  LOG:  redo done at 8F42/645FBC90
2019-02-13 20:19:31 CET  LOG:  last completed transaction was at log time 2019-02-13 20:19:12.354569+01
2019-02-13 20:19:31 CET  LOG:  selected new timeline ID: 4
2019-02-13 20:19:31 CET  LOG:  archive recovery complete
2019-02-13 20:19:31 CET  LOG:  database system is ready to accept connections

3. Start the old master after putting there a recovery.conf to make it the new standby

Log from the original master:
2019-02-13 20:23:22 CET  LOG:  database system was shut down at 2019-02-13 20:19:16 CET
2019-02-13 20:23:22 CET  LOG:  entering standby mode
2019-02-13 20:23:22 CET  LOG:  consistent recovery state reached at 8F42/645FBD00
2019-02-13 20:23:22 CET  LOG:  invalid record length at 8F42/645FBD00: wanted 24, got 0
2019-02-13 20:23:22 CET  LOG:  database system is ready to accept read only connections
2019-02-13 20:23:22 CET  LOG:  fetching timeline history file for timeline 4 from primary server
2019-02-13 20:23:22 CET  LOG:  started streaming WAL from primary at 8F42/64000000 on timeline 3
2019-02-13 20:23:22 CET  LOG:  replication terminated by primary server
2019-02-13 20:23:22 CET  DETAIL:  End of WAL reached on timeline 3 at 8F42/645FBD00.
2019-02-13 20:23:22 CET  LOG:  new target timeline is 4
2019-02-13 20:23:22 CET  LOG:  restarted WAL streaming at 8F42/64000000 on timeline 4
2019-02-13 20:23:22 CET  LOG:  redo starts at 8F42/645FBD00

So far so good (except for the invalid record length, but it’s only LOG).

4. A bit later I shutdown the original master server itself (which runs the now-standby PG cluster) due to VM configuration. It takes a few minutes. I start the server.

Log from the original master:
2019-02-13 20:33:39 CET  LOG:  received fast shutdown request
2019-02-13 20:33:39 CET  LOG:  aborting any active transactions
2019-02-13 20:33:39 CET  FATAL:  terminating walreceiver process due to administrator command
2019-02-13 20:33:39 CET  LOG:  shutting down
2019-02-13 20:33:40 CET  LOG:  database system is shut down
2019-02-13 20:38:45 CET  LOG:  database system was shut down in recovery at 2019-02-13 20:33:39 CET
2019-02-13 20:38:45 CET  LOG:  entering standby mode
2019-02-13 20:38:45 CET  LOG:  redo starts at 8F42/E7DC51B8
2019-02-13 20:38:47 CET  WARNING:  could not open directory "base/26083437": No such file or directory
2019-02-13 20:38:47 CET  CONTEXT:  WAL redo at 8F42/EDE3E720 for Database/DROP: dir 26083437/1663
2019-02-13 20:38:47 CET  WARNING:  some useless files may be left behind in old database directory "base/26083437"
2019-02-13 20:38:47 CET  CONTEXT:  WAL redo at 8F42/EDE3E720 for Database/DROP: dir 26083437/1663

I don’t like these because they are WARNINGS. I believe they might be there because sometimes I check the replication with a createdb something (on master); psql -l (on standby); dropdb something (on master). Although I do it when they both are running.

2019-02-13 20:39:06 CET  LOG:  consistent recovery state reached at 8F43/38B80FB0
2019-02-13 20:39:06 CET  LOG:  database system is ready to accept read only connections
2019-02-13 20:39:06 CET  LOG:  started streaming WAL from primary at 8F43/38000000 on timeline 4

Except for the let’s-hope-not-harmful WARNING, it looks okay. There was no new log on the new master.

5. A bit later I restart the new master cluster as I set up things.

Log from the new master (was slave):
2019-02-13 20:42:04 CET  LOG:  received fast shutdown request
2019-02-13 20:42:04 CET  LOG:  aborting any active transactions
2019-02-13 20:42:04 CET  FATAL:  terminating autovacuum process due to administrator command
2019-02-13 20:42:04 CET  LOG:  worker process: logical replication launcher (PID 2455) exited with exit code 1
2019-02-13 20:42:05 CET  LOG:  shutting down
2019-02-13 20:42:08 CET  LOG:  database system is shut down
2019-02-13 20:42:08 CET  LOG:  database system was shut down at 2019-02-13 20:42:08 CET
2019-02-13 20:42:08 CET  LOG:  database system is ready to accept connections

Here it comes:

Log from the original master:
2019-02-13 20:42:08 CET  LOG:  replication terminated by primary server
2019-02-13 20:42:08 CET  DETAIL:  End of WAL reached on timeline 4 at 8F43/AB4E6438.
2019-02-13 20:42:08 CET  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
2019-02-13 20:42:08 CET  LOG:  invalid resource manager ID 118 at 8F43/AB4E6438
2019-02-13 20:42:08 CET  FATAL:  could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-02-13 20:42:13 CET  LOG:  started streaming WAL from primary at 8F43/AB000000 on timeline 4

Although it’s also only a LOG, I find it scary because when I googled for it, I found it only in terrible context usually followed by a PANIC.

For example: 

Also, I could reproduce the "invalid record length at ..." messages on a local machine (vacuum + update + restarting the master and the standby), but I could not reproduce the "invalid resource manager ID" message no matter how hard I tried.

(Although the local machine has PostgreSQL 11.1)

Besides, it doesn’t go away. Late, I restarted the master one more time.

Log from the new master (was slave):
2019-02-13 21:35:55 CET  LOG:  database system was shut down at 2019-02-13 21:35:54 CET
2019-02-13 21:35:55 CET  LOG:  database system is ready to accept connections

Log from the original master:
2019-02-13 21:35:54 CET  LOG:  replication terminated by primary server
2019-02-13 21:35:54 CET  DETAIL:  End of WAL reached on timeline 4 at 8F46/93705AF0.
2019-02-13 21:35:54 CET  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
2019-02-13 21:35:54 CET  LOG:  invalid resource manager ID 31 at 8F46/93705AF0
2019-02-13 21:35:54 CET  FATAL:  could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-02-13 21:35:59 CET  LOG:  started streaming WAL from primary at 8F46/93000000 on timeline 4


Since I’m not someone who says: "It’s okay, it works", I ended up running pg_basebackups after each promote due to these messages. Which means, I deleted the data directory on the old master after each promote. And the fact that the basebackup make these log messages disappear tells me that it might have been the right choice to do it. I mean, if there is no problem at all, why do I see these messages after I connect the old master to the new master, and why aren’t they there after a basebackup? What’s the difference?

There are two master-standby pairs I am upgrading now. I had to re-create the disks in each machine (for other reasons). Since I wanted the websites to stay online, I went to the upgrade the standby => promote the standby => upgrade the oiriginal master => promote the original master line. I had maybe 3 seconds downtime twice.

I believe I could have done it with zero pg_basebackup, but I already did it four or five times due to these messages, and I am not yet finished. During the long basebackup, I feel less safe, and my apps can’t read the standby. 

I have been using PostgreSQL for ages. I use and love almost every feature of it.

The log messages, on the other hand, are more esoteric to me than the original Maya calendar was which sank with Atlantis. Maybe I’m not the only user, who, without reading the source code, would expect something like this:

LOG:  Everything is all right. I did what you asked, and I talk about it.
NOTICE: Please fix your pg_hba.conf
WARNING: If I weren’t rock solid ACID, you would be in a trouble now. Don’t do this again.
ERROR: The query was wrong but your data is safe.
FATAL: Do you have a backup?
PANIC: Do you have a legal insurance?

But the reality in my understanding is something like this:

LOG: Your data is either corrupted or not.
WARNING: I think something is wrong. (But it’s not)
ERROR: Your data is either corrupted or not.
FATAL: I’m the least harmful kind of log message in PostgreSQL. Just give me two seconds. Your data is safe.


I can see what the different kind of log messages communicate: https://www.postgresql.org/docs/11/runtime-config-logging.html

However, when I look at the log, in 100 of 100 times, I am more interested in whether my data is safe or not than whether only the current command was aborted or the whole session.

If someone please explains it what the invalid resource manager ID message means, they will have my eternal thanks.

M.

Re: Streaming replication - invalid resource manager ID

От
Laurenz Albe
Дата:
Maeldron T. wrote:
> I’m a bit stuck with a few LOG messages. The one in the subject disturbs me the most.
> 
> (FreeBSD 11.2, PostgreSQL 10.6)
> 
> When I restart the hot standby, sometimes I see
> 
> 2019-02-13 21:17:31 CET LOG:  redo starts at 0/488A2748
> 2019-02-13 21:17:31 CET LOG:  invalid record length at 0/488A2780: wanted 24, got 0
> 2019-02-13 21:17:31 CET LOG:  database system is ready to accept read only connections
> 2019-02-13 21:17:31 CET LOG:  started streaming WAL from primary at 0/48000000 on timeline 1
> 
> I try to believe it is harmless. I found two emails in the archive telling that this is harmless,
> but they were from many years ago. Is it really harmless? (Without PANIC, etc).
> The main reason I believe it’s harmless is that I often see it.

It is harmless.

When WAL is replayed from a file, there comes the time when you hit the end
of the WAL. WAL files have a fixed, preallocated length, so you notice that
you have reached the end if you try to read the next WAL record and find
only garbage, or the checksum is not correct.

This is a typical "I found garbage, so I guess it's the end of WAL".
PostgreSQL just explains why it considers the WAL record garbage, so debugging
is easier.

If the database logs a LOG, it thinks that nothing is wrong.

> The one I see less often is: "invalid resource manager ID"
> 
> The last time I encountered it when I made the standby the new master, and the original master the new standby.
Steps:

That is the same thing, only the check if the next WAL record header is
valid passed the first check (the length was big enough) and fails
the second.

It means that the number found in the "resource manager" field of the header
was bigger than the allowed maximum.

See the function "ValidXLogRecordHeader" in "src/backend/access/transam/xlogreader.c".

> 2019-02-13 20:38:45 CET  LOG:  entering standby mode
> 2019-02-13 20:38:45 CET  LOG:  redo starts at 8F42/E7DC51B8
> 2019-02-13 20:38:47 CET  WARNING:  could not open directory "base/26083437": No such file or directory
> 2019-02-13 20:38:47 CET  CONTEXT:  WAL redo at 8F42/EDE3E720 for Database/DROP: dir 26083437/1663
> 2019-02-13 20:38:47 CET  WARNING:  some useless files may be left behind in old database directory "base/26083437"
> 2019-02-13 20:38:47 CET  CONTEXT:  WAL redo at 8F42/EDE3E720 for Database/DROP: dir 26083437/1663
> 
> I don’t like these because they are WARNINGS. I believe they might be there because sometimes
> I check the replication with a createdb something (on master); psql -l (on standby);
> dropdb something (on master). Although I do it when they both are running.

Wow, that's an ... interesting method to check if replication is working.
What about something lighter, like running an UPDATE on a test table?

If you read "dbase_redo" in "src/backend/commands/dbcommands.c", you'll
see this:

    /*
     * Our theory for replaying a CREATE is to forcibly drop the target
     * subdirectory if present, then re-copy the source data. This may be
     * more work than needed, but it is simple to implement.
     */
    if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
    {
        if (!rmtree(dst_path, true))
            /* If this failed, copydir() below is going to error. */
            ereport(WARNING,
                    (errmsg("some useless files may be left behind in old database directory \"%s\"",
                            dst_path)));
    }

That explains this nicely.

> Since I’m not someone who says: "It’s okay, it works", I ended up running pg_basebackups after each
> promote due to these messages.

> I have been using PostgreSQL for ages. I use and love almost every feature of it.
> 
> The log messages, on the other hand, are more esoteric to me than the original Maya calendar
> was which sank with Atlantis. Maybe I’m not the only user, who, without reading the source code,
> would expect something like this:
> 
> LOG:  Everything is all right. I did what you asked, and I talk about it.
> NOTICE: Please fix your pg_hba.conf
> WARNING: If I weren’t rock solid ACID, you would be in a trouble now. Don’t do this again.
> ERROR: The query was wrong but your data is safe.
> FATAL: Do you have a backup?
> PANIC: Do you have a legal insurance?
> 
> But the reality in my understanding is something like this:
> 
> LOG: Your data is either corrupted or not.
> WARNING: I think something is wrong. (But it’s not)
> ERROR: Your data is either corrupted or not.
> FATAL: I’m the least harmful kind of log message in PostgreSQL. Just give me two seconds. Your data is safe.

The logic is like this:

LOG: This should go to the log so the reader knows what happened.
     No reason to worry.

NOTICE: Similar to log, but for the client, not the DBA.
        "Hey, I automatically created a sequence for you."

WARNING: Processing can continue, but this looks suspiciously
         like you made a mistake somewhere.
         (e.g., entering COMMIT when no transaction is active.)

ERROR: The current activity has to be interrupted, and the
       current transaction is dead.
       This could be a typo or data corruption.

FATAL: A database connection has to be terminated.
       It is fatal only for the connection, not for the database.
       (e.g., the WAL receiver complains that hist friend on the
       other end has hung up, so it goes to diem having nothing to do)

PANIC: an error in a critical section
       there is no good way to recover from this, and shared memory state
       might be corrupted, so reinitialize and perform crash recovery.


I have included the references to the source deliberately.

Since you say you like PostgreSQL and are someone who wants to understand what is
going on, your way will eventually lead to the source code.
Many of the error messages only make sense if you understand how PostgreSQL works,
e.g. that there is a WAL receiver process that cannot proceed if the master
server is shut down.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com