On 04/17/2018 10:55 AM, Emre Hasegeli wrote:
>> Can you check if the "incorrect" part of the WAL segment matches some
>> previous segment? Verifying that shouldn't be very difficult (just cut a
>> bunch of bytes using hexdump, compare to the incorrect data). Assuming
>> you still have the WAL archive, of course. That would tell us that the
>> corrupted part comes from an old recycled segment.
>
> I had found and saved the recycled WAL file from the archive after the
> incident. Here is the hexdump of it at the same position:
>
> 0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164
> 0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50
> 0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52
> 0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045
> 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
> 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
> 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
> 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e
>
> If you compare it with the other 2 I have posted, you would notice
> that the corrupted file on standby is combination of the two. The
> data on it starts with the data on the master, and continues with the
> data of the recycled file. The switch is at the position 0bddff8
> which is the position printed as "Minimum recovery ending location" by
> pg_controldata.
>
OK, this seems to confirm the theory that there's a race condition
between segment recycling and replicating. It's likely limited to short
period after a crash, otherwise we'd probably see many more reports.
But it's still just hunch - someone needs to read through the code and
check how it behaves in these situations. Not sure when I'll have time
for that.
>> Hmmm, I see you're using SSL. I don't think that could break affect
>> anything, but maybe I should try mimicking this aspect too.
>
> This is the connection information. Although the master shows SSL
> compression is disabled in despite of being explicitly asked for.
>
>> primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1'
Hmmm, that seems like a separate issue. When you say 'master shows SSL
compression is disabled' where do you see that?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services