Fwd: Data corruption after restarting replica

Поиск

Список

Период

Сортировка

От	Novák, Petr
Тема	Fwd: Data corruption after restarting replica
Дата	16 февраля 2015 г. 19:48:56
Msg-id	CA+eEC0q=GsQ+KKHwknoLE1rdmOAXHM6Eyf1j7t0L3iFk03yATw@mail.gmail.com обсуждение исходный текст
Ответы	Re: Fwd: Data corruption after restarting replica (Adrian Klaver <adrian.klaver@aklaver.com>) Re: Fwd: Data corruption after restarting replica (dinesh kumar <dineshkumar02@gmail.com>) Re: Fwd: Data corruption after restarting replica (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список	pgsql-general

Дерево обсуждения

Hello,

sorry for posting to second list, but as I've received  no reply
there, I'm trying my luck here.

Thanks
Petr


---------- Forwarded message ----------
From: Novák, Petr <novakp@avast.com>
Date: Tue, Feb 10, 2015 at 12:49 PM
Subject: Data corruption after restarting replica
To: pgsql-bugs@postgresql.org


Hi all,

we're experiencing data corruption after switching streamed replica to primary.
This is not the first time I've encountered this issue, so I'l try to
describe it in more detail.

For this particular cluster we have 6 servers in two datacenters (3 in
each). There are two instances running on each server, each with its
own port and datadir. On the first two servers in each datacenter one
instance is primary and the other is replica for the primary from the
other server. Third server holds two offsite replicas from the other
datacenter (for DR purposes)

Each replica was set up by taking pg_basebackup from primary
(pg_basebackup -h <hostname> -p 5430 -D /data2/basebackup -P -v -U
<user> -x -c fast). Then directories from initdb were replaced with
the ones from basebackup (only the configuration files remained) and
the replica started and was successfully connected to primary. It was
running with no problem keeping up with the primary. We were
experiencing some connection problem between the two datacenters, but
replication didn't break.

Then we needed to take one datacenter offline due to hardware
maintenance. So I've switched the applications down, verified that no
more clients were connected to primary, then shut the primary down and
restarted replica without recovery.conf and the application were
started using the new db with no problem. Other replica even
successfully reconnected to this new primary.

Few hours from the switch lines appeared in the server log (which
didn't appear before), indicating a corruption:

ERROR:  index "account_username_key" contains unexpected zero page at
block 1112135
ERROR:  right sibling's left-link doesn't match: block 476354 links to
1062443 instead of expected 250322 in index "account_pkey"

..and many more reporting corruption in several other indexes.

The issue was resolved by creating new indexes and dropping the
affected ones, although there were already some duplicities in the
data, that has to be resolved, as some of the indexes were unique.

This particular case uses Postgres 9.1.14 on both primary and replica.
But I've experienced similar behavior on 9.2.9. OS Centos 6.6 in all
cases. This may mean, that there can be something wrong with our
configuration or the replication setup steps, but I've set up another
instance using the same steps with no problem.

Fsync related setting are at their defaults. Data directories are on
RAID10 arrays, with BBUs. Filesystem is ext4 mounted with nobarrier
option.

Database is fairly large ~120GB with several 50mil+ tables, lots of
indexes and FK constraints. It is mostly queried,
updates/inserts/deletes are only several rows/s.

Any help will be appreciated.

Petr Novak

System Engineer
Avast s.r.o.

В списке pgsql-general по дате отправления:

Предыдущее

От: Ramesh T
Дата: 16 февраля 2015 г., 19:41:53
Сообщение: Re: postgres cust types

Следующее

От: Guillaume Drolet
Дата: 16 февраля 2015 г., 22:32:13
Сообщение: Starting new cluster from base backup

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Fwd: Data corruption after restarting replica

Предыдущее

Следующее