Обсуждение: [HACKERS] Funny WAL corruption issue

Поиск
Список
Период
Сортировка

[HACKERS] Funny WAL corruption issue

От
Chris Travers
Дата:
Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load.  Master database has two slaves and generates 10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts).  On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:

2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any process

2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any process

2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any process

2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any process


On restart, the following was logged to stderr:

LOG:  entering standby mode

LOG:  redo starts at 1E39C/8B77B458

LOG:  consistent recovery state reached at 1E39C/E1117FF8

FATAL:  invalid memory alloc request size 3456458752

LOG:  startup process (PID 18167) exited with exit code 1

LOG:  terminating any other active server processes

LOG:  database system is shut down


After some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.

I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync. 
--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: [HACKERS] Funny WAL corruption issue

От
Vladimir Borodin
Дата:
Hi, Chris.

10 авг. 2017 г., в 15:09, Chris Travers <chris.travers@gmail.com> написал(а):

Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load.  Master database has two slaves and generates 10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts).  On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:
2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any process

On restart, the following was logged to stderr:
LOG:  entering standby mode
LOG:  redo starts at 1E39C/8B77B458
LOG:  consistent recovery state reached at 1E39C/E1117FF8
FATAL:  invalid memory alloc request size 3456458752
LOG:  startup process (PID 18167) exited with exit code 1
LOG:  terminating any other active server processes

LOG:  database system is shut down

After some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.

We have reported the same thing [1] nearly a year ago. Could you please check with pg_xlogdump that both WALs (normal from master and corrupted) are exactly the same until some certain LSN?



I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync. 
--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.


--
May the force be with you…

Re: [HACKERS] Funny WAL corruption issue

От
Chris Travers
Дата:
Sorry, meant to reply all.

On Thu, Aug 10, 2017 at 2:19 PM, Vladimir Borodin <root@simply.name> wrote:
Hi, Chris.

10 авг. 2017 г., в 15:09, Chris Travers <chris.travers@gmail.com> написал(а):

Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal corruption and wanted to go over what I think happened and what I wonder about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB database with variable load.  Master database has two slaves and generates 10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave using pg_ctl (I believe the service had been started with sys V init scripts).  On trying to restart, it gave me a nice "Invalid memory allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:
2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any process
2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any process

On restart, the following was logged to stderr:
LOG:  entering standby mode
LOG:  redo starts at 1E39C/8B77B458
LOG:  consistent recovery state reached at 1E39C/E1117FF8
FATAL:  invalid memory alloc request size 3456458752
LOG:  startup process (PID 18167) exited with exit code 1
LOG:  terminating any other active server processes

LOG:  database system is shut down

After some troubleshooting I found that the wal segment had become corrupt, I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and when we tried to restart, the wal ended in an invalid way.

We have reported the same thing [1] nearly a year ago. Could you please check with pg_xlogdump that both WALs (normal from master and corrupted) are exactly the same until some certain LSN?


Yes.  Exactly the same output until a certain point where pg_xlogdump dies with an error.  That is the same LSN where PostgreSQL died with an error on restart. 


I am wondering what can be done to prevent these sorts of things from happening in the future if, for example, a replica dies in the middle of a wal fsync. 
--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.


--
May the force be with you…




--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: [HACKERS] Funny WAL corruption issue

От
Chris Travers
Дата:

Yes.  Exactly the same output until a certain point where pg_xlogdump dies with an error.  That is the same LSN where PostgreSQL died with an error on restart. 

 One other thing that is possibly relevant after reading through the last bug report is the error pgxlogdumo gives:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset 1146880



--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: [HACKERS] Funny WAL corruption issue

От
Aleksander Alekseev
Дата:
Hi Chris,

> I ran into a funny situation today regarding PostgreSQL replication and
> wal corruption and wanted to go over what I think happened and what I
> wonder about as a possible solution.

Sad story. Unfortunately I have no idea what could be a reason nor can I
suggest a good way to find it unless there is an already know sequence
of steps that reproduces an issue.

I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.

If the issue reproduces from time to time on one replica and doesn't on
the second identical replica there is a good chance that you've faced a
hardware issue. Another thing that is worth checking is a SMART status
of the hard drive.

[1]: http://www.pgcon.org/2017/schedule/attachments/453_corruption-pgcon-2017.pdf

--
Best regards,
Aleksander Alekseev

Re: [HACKERS] Funny WAL corruption issue

От
Vladimir Rusinov
Дата:


On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.

This still might be the case of hardware corruption, but it does not look like one.

Likelihood of two different persons seeing similar error message just a year apart is low. From our practice hardware corruption usually looks like a random single bit flip (most common - bad cpu or memory), bunch of zeroes (bad storage), or bunch of complete garbage (usually indicates in-memory pointer corruption).

Chris, if you still have original WAL segment from the master and it's corrupt copy from standby, can you do bit-by-bit comparison to see how they are different? Also, if you can please share some hardware details. Specifically, do you use ECC? If so, are there any ECC errors logged? Do you use physical disks/ssd or some form of storage virtualization?

Also, in absolute majority of cases corruption is caught by checksums. I am not familiar with WAL protocol - do we have enough checksums when writing it out and on the wire? I suspect there are much more things PostgreSQL can do to be more resilient, and at least detect corruptions earlier.

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047

Re: [HACKERS] Funny WAL corruption issue

От
Chris Travers
Дата:


On Thu, Aug 10, 2017 at 3:17 PM, Vladimir Rusinov <vrusinov@google.com> wrote:


On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.

This still might be the case of hardware corruption, but it does not look like one.

Yeah, I don't think so either.  The systems were not restarted, only the service so I don't think this is a lie-on-write case.  We have EEC with full checks, etc.  It really looks like something I initiated caused it but not sure what and really not interested in trying to reproduce on a db of this size. 

Likelihood of two different persons seeing similar error message just a year apart is low. From our practice hardware corruption usually looks like a random single bit flip (most common - bad cpu or memory), bunch of zeroes (bad storage), or bunch of complete garbage (usually indicates in-memory pointer corruption).

Chris, if you still have original WAL segment from the master and it's corrupt copy from standby, can you do bit-by-bit comparison to see how they are different? Also, if you can please share some hardware details. Specifically, do you use ECC? If so, are there any ECC errors logged? Do you use physical disks/ssd or some form of storage virtualization?

Straight on bare metal, eec with no errors logged.  SSD for both data and wal.

The bitwise comparison is interesting.  Remember the error was:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset 1146880


Starting with the good segment:

Good wall segment, I think the record starts at 003b:


0117fb0 0000 0000 0000 0000 003b 0000 0000 0000

0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01

0117fd0 0200 0000 067f 0000 4000 0000 2249 0195

0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000

0117ff0 0000 0003 0000 0000 008c 0000 0000 0000

0118000 d093 0005 0001 0000 8000 e111 e39c 0001

0118010 0084 0000 0000 0000 7fb8 e111 e39c 0001

0118020 0910 0000 ccac 2eba 2000 0056 067f 0000

0118030 4000 0000 2249 0195 b5c4 0000 08ff 0001

0118040 0002 0003 0004 0005 0006 0007 0008 0009

0118050 000a 000b 000c 000d 000e 000f 0010 0011

0118060 0012 0013 0014 0015 0016 0017 0018 0019

0118070 001a 001b 001c 001d 001e 001f 0020 0021



0117fb0 0000 0000 0000 0000 003b 0000 0000 0000

0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01

0117fd0 0200 0000 067f 0000 4000 0000 2249 0195

0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000

0117ff0 0000 0003 0000 0000 4079 ce05 1cce ecf9

0118000 d093 0005 0001 0000 8000 6111 e375 0001

0118010 119d 0000 0000 0000 cfd4 00cc ca00 0410

0118020 1800 7c00 5923 544b dc20 914c 7a5c afec

0118030 db45 0060 b700 1910 1800 7c00 791f 2ede

0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210

0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef

0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355

0118070 d432 8365 de99 4dba 87c7 00ed 6200 2210 

I think the divergence is interesting here.  Up through 0117ff8, they are identical.  Then the last half if the line differs.
The first half of the next is the same (but up through 011800a this time), but the last 6 bytes differ (those six hold what appear to be the memory address causing the problem), and we only have a few bits different in the rest of the line.

It looks like some data and some flags were overwritten, perhaps while the process exited.  Very interesting.


Also, in absolute majority of cases corruption is caught by checksums. I am not familiar with WAL protocol - do we have enough checksums when writing it out and on the wire? I suspect there are much more things PostgreSQL can do to be more resilient, and at least detect corruptions earlier.

Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047




--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: [HACKERS] Funny WAL corruption issue

От
Greg Stark
Дата:
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote:
>
>
> The bitwise comparison is interesting.  Remember the error was:
>
> pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC
check),would this perhaps indicate that the checksum operated over incorrect data?
 

No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.

I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data.  I haven't actually tried this though.

-- 
greg



Re: [HACKERS] Funny WAL corruption issue

От
Chris Travers
Дата:


On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark <stark@mit.edu> wrote:
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote:
>
>
> The bitwise comparison is interesting.  Remember the error was:
>
> pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?

No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.

I expect to test this theory shortly.

Assuming it is correct, what can we do to prevent restarts of slaves from running into it? 


I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data.  I haven't actually tried this though.

--
greg



--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: [HACKERS] Funny WAL corruption issue

От
Chris Travers
Дата:


On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark <stark@mit.edu> wrote:
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote:
>
>
> The bitwise comparison is interesting.  Remember the error was:
>
> pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?

No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.

I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data.  I haven't actually tried this though.

For better or worse, I get a different error at the same spot if I try this:

Doing so involved disabling the check in the backend wal reader.

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: invalid contrecord length 4509 at 1E39C/E1117FF8

If I hack it to ignore all errors on that record, no further records come out though it does run over the same records.

This leads me to conclude there are no further valid records.
 

--
greg



--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.