Re: Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Standby corruption after master is restarted
Дата
Msg-id 1da55c73-4bd1-f13e-2d4b-c4049ffd73f5@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Ответы Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Список pgsql-bugs

On 04/17/2018 10:55 AM, Emre Hasegeli wrote:
>> Can you check if the "incorrect" part of the WAL segment matches some
>> previous segment? Verifying that shouldn't be very difficult (just cut a
>> bunch of bytes using hexdump, compare to the incorrect data). Assuming
>> you still have the WAL archive, of course. That would tell us that the
>> corrupted part comes from an old recycled segment.
> 
> I had found and saved the recycled WAL file from the archive after the
> incident.  Here is the hexdump of it at the same position:
> 
> 0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164
> 0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50
> 0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52
> 0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045
> 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
> 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
> 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
> 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e
> 
> If you compare it with the other 2 I have posted, you would notice
> that the corrupted file on standby is combination of the two.  The
> data on it starts with the data on the master, and continues with the
> data of the recycled file.  The switch is at the position 0bddff8
> which is the position printed as "Minimum recovery ending location" by
> pg_controldata.
> 

OK, this seems to confirm the theory that there's a race condition 
between segment recycling and replicating. It's likely limited to short 
period after a crash, otherwise we'd probably see many more reports.

But it's still just  hunch - someone needs to read through the code and 
check how it behaves in these situations. Not sure when I'll have time 
for that.

>> Hmmm, I see you're using SSL. I don't think that could break affect
>> anything, but maybe I should try mimicking this aspect too.
> 
> This is the connection information.  Although the master shows SSL
> compression is disabled in despite of being explicitly asked for.
> 
>> primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1'

Hmmm, that seems like a separate issue. When you say 'master shows SSL 
compression is disabled' where do you see that?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15160: planner overestimates number of rows in join when thereare more than 200 rows coming from CTE
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15161: libpq - Invalid SSPI context after PQreset